Friday, November 8, 2013

Generating Communicative Actions

This week we will read two papers focused on generating communicative actions.  The first one is currently in submission at a conference, and the second two papers were published last year at RSS and HRI by Sidd Srinivasa's group at CMU.   There is a deep mathematical connection between the two approaches, which we will discuss in class.

11/12/13 (Tues.)  Generating Language
11/14/13 (Thurs.) Generating Motion

For Tuesday, please post on the blog suggestions for improving the Tellex et al. paper.  I'm looking for about 200 words.   This paper is not yet in its final version, so your comments will help improve it.  It will also give you practice critically evaluating research which you can apply to your own work as you finish up your final projects.

13 comments:

  1. I was confused for a while on the equations in Section 4.3. The part that confused me was the max(Gamma given gamma_a = gamma_a^*). I think this means the best /most likely possible new grounding given that the desired action has some grounding variable result. I think I just confused myself by initially interpreting this as argmax(gamma_a^*), which obviously doesn’t make sense since you are then considering every possible action grounding. Clarifying this might be beneficial, but if I’m the only one who had this problem, then it isn’t really a big deal.
    I think the step-by-step derivation in 4.4 was great. Showing the results of chance and hand-written requests from AMT was also really useful in giving both an upper and lower baseline.
    I would be curious as to see some of the properties the hand-written requests had that the S2 lacked that resulted in the 30% difference in success rates in the AMT tests. It might be interesting to include some qualitative differences between the two to highlight why the gap arises and how one might bridge it.

    ReplyDelete
  2. First of all, the discrepancy in the results of the corpus-based evaluation and the user study comes from the fact that in the corpus-based evaluation, subjects were not aware to the history of actions whereas in the user study they were. This is why ‘Help me” baseline condition was more of a random choice condition in the corpus-based case and scored fifty percent in the user study which is close to the number of the Template baseline case and the G3 with S1.
    As pointed out in the conclusion, nonverbals play an important role for human subjects to suppose what robot wants and what actions a human should provide. When humans conduct a hand-off action they exchange many kinds of nonverbal cues (such as gaze and hand gestures) to inform the other person which hand to give the object to or when to give.
    Thinking about Baxter, we can think of a robot having a face display to show its gaze or having a simple hand-off gestures (shaking its hand) before the hand-off.
    In overall, the entire task has to be error-prone to minimize the experimenter’s interventions and the robot has to be more explicit in asking its questions both in nonverbal and verbal way.

    ReplyDelete
  3. In general I thought the paper was very clear. I appreciated especially the
    care you all took in placing the work in context, and in the detailed derivations.

    Some things I found confusing (which need not correspond to improvements)
    in chronological order:

    Why is G^3 an appropriate model for human language understanding? (I imagine
    it's pretty good from a behavioral standpoint, but if you've thought about
    this it would be interesting to explain more.)

    3.1 What are the implications of errors in the "actual state" (e.g. VICON
    failures)?

    3.2 I think an example of the symbolic representation would make it easier
    to think about.

    4 I was confused for awhile about the terminology surrounding capital lambda. I
    concluded that it was a sentence, but you also referred to it as "words" at the
    beginning of 4. I think calling it a sentence throughout might make it more
    clear?

    4.1 What exactly is a constituent?

    4.3
    - I could not wrap my head around gamma_a*
    - I'm also not familiar with the notation used in eq. 6 under the max (what
    is the bar?)

    5.2 Was not initially clear to me that you did two distinct evaluations
    (maybe just emphasize with a starting transition, "We also did a user study...")

    Fig.8 The second and third sentences don't quite parse for me. Maybe mark with
    [sic]?

    ReplyDelete
  4. One of the problems I can think of is that, it is better to develop a dialog system so the robot can clarify its request by answering people’s questions in real time, which could provide disambiguation and make the request done when people cannot understand the robot’s request. As what we read in the past days, I think some existing dialog systems are powerful enough to be employed into this design because the design of this system is good and the ambiguity is not that obviously. Perhaps this paper only partially focuses on the dialog system(?), and that may be reason why it doesn’t implement such a dialog system. But definitely the performance of the system will be improved with a dialog system.

    My second question for this paper is if the robot can update the system after ask people’s help and if the robot will fail again for the same reason. Although the design of the system enables the robot to recover from failures by asking people for help, the paper doesn’t specify (or I did not notice) the future usage of this recovery process(e.g will it update the system data so that the robot will not fail in the same situation again? Or some failures are inevitable to avoid). If not, the person may feel impatient if the robot continually asking for the same help.

    ReplyDelete
  5. In the user study of evaluation part, the paper compared S0 baseline approach with the S2 inverse semantics approach in a real IKEA furniture assembly environment. So the S2 approach is indeed more efficient and accurate than the S0 baseline approach. But I'm wondering why it didn't compare template baseline approach with S2 approach? Intuitively, template baseline approach would take less time than S0 and is close to inverse semantics approach. And the overall success rate would be more closer. Most importantly, template approach is easy to implement and can have more expressiveness. Template-based approach only needs to lookup table of requests and could even be integrated in the symbolic planner. And by employing the dataset of natural language requests gathering from AMT, it can use some of it to improve its expressiveness.
    Also, the article mentioned that there were 50 hand-off requests among 102 help requests during the study and subjects found it difficult to successfully hand a part to the robot despite initial training. So I'm thinking of if this hand-off action is a 'big' action, in other words, could we divide this action to several small actions in order to make it easy for people to do?

    ReplyDelete
  6. Firstly, I found the paper to be generally clear and well-organized. Most of the issues I had in understanding the content involved mathematical formalism with which I am unfamiliar, but in general I found this paper spent more time than most on derivation which was much appreciated.

    Some pieces of notation which I did not understand were the more complicated max operator used in 4.3 which has what appears to be a conditional statement as its argument.

    In addition, I did not really understand the distinction between gamma_a and gamma_a^*.

    I thought perhaps some of the data representation in the results/discussion sections could be presented in a more easily consumable manner (there is a nice table which demonstrates at the beginning of section five the fraction of correctly followed help requests using various metrics for speech generation -- after this, the results are mostly large walls of text.) As lame as I feel saying something like "more charts and graphs," it seems important to me as a reader to be able to see clearly that the methodology presented in the paper performs to the standards set by the researchers.

    Finally, I'm not sure if this is within the scope of the paper, but if I was unfamiliar with the G3 system going in, the overview presented in the paper would not be sufficient for me to pick it up or make inference from it. (This could also be said about much of the background mathematics... not sure what to do with this one).

    ReplyDelete
  7. Chrome snarfed my first draft, so I’m skipping the generally polite comments about the strengths of the paper. In short, I think it’s very clear and describes a clever contribution.

    It’s not clear to me why S1 doesn’t devolve into generating extremely general help requests like, “help me”. If it always chooses the request most likely to be correct, how does it at least generate something worthwhile? I would like more clarification on why this wasn’t occurring, and if they had to limit the vocabulary in order to prevent it.

    It appeared that the plan the robots followed required specific table legs to be inserted in specific corners of the table. However with most Ikea pieces, there are always several duplicate items. I would be like more detail on how the system could handle object invariance, where it didn’t care about which item of a category of items was chosen. It seemed that one of the key reasons S1 performed much worse than S2 was that it implied the system wouldn’t care about which leg, if it only asked for “the white leg”. I would believe that a model combining S1 and S2 would perform best when object specificity could be relaxed. The requests would need to be more general to imply this non-specificity.

    I think one of the frustrations of working with systems that generated vague help requests, like “help me”, would result in numerous randomly chosen actions on the part of the human. And indeed, the authors do make this claim. However, if they could show the number of actions tried per help request it would make a strong case that helpful actions improve the interaction.

    I found the photos used for the AMT evaluations to be a bit dark. I would suggest adjusting the levels of these images in an image processing program.

    ReplyDelete
  8. The paper presents methods to generate natural language requests for help, by a robot when it is stuck at a given task.
    I like the paper for its simplicity in explaining its ideas and motivations. The setup is explained clearly enough and the experiment and result sections. What I do not like is the development of math. I would like more details, especially being explained what each symbol stands for in words. The confusion for gamma_a and gamma_a*. The section 4.4 on minimization of user uncertainty has a lot of math and it would help if the meaning of the conditional probabilities were expressed in that section before they were used. The minimization trick here is still unclear to me though, and it is something that I would like to take back when I read the paper.

    ReplyDelete
  9. Overall I really liked this paper. The evaluations in particular were very thorough, and each iteration of the algorithm (s_0, s_1, s_2) was explained clearly and succinctly.

    However - it's not clear to me that there is a problem to solve - in section 4 you mention that the contribution of the paper is a definition for *h* using inverse semantics. Why is this methodology of speech generation any better than the related models you present in section 2? Why is there a *need* for using inverse semantics? The evaluation section is convincing, but why are these results a win? I think what I'm hoping for (specifically) is why other systems (beyond the baseline Template system - perhaps those mentioned in related work) cannot achieve what your system achieves.

    After rereading the related work section, I noticed that you list some advantages to the inverse semantic model, but I suppose making this clearer might benefit a few (skeptical) readers.

    That was my only substantial suggestion. The other would be to add my voice to Isaak's; if I hadn't seen G^3 before, I think that any part of the formalism that involved the G^3 framework would have been lost on me.

    ReplyDelete
  10. The paper is a lot clearer than many others we've read. I like that it shows the steps that derive the main equations rather than dropping them with no explanation. The summary of G^3 was particularly helpful and conveyed a lot of useful background information in a short amount of text.

    I felt like the section describing the how the requests were formed could offer some more detail of how it would it implemented so as to understand what's going on. For S_1 it says that the robot will “search for language that best matches the action the robot would like the human to take” and then it drops an equation but I'm not sure I see exactly how the equation would become an algorithm for finding a particular sentence. I guess I can see that too many implementation details are extraneous to the demonstrating the models and result of the paper, but I feel like a quick reference to how the equation is used might make it more clear. Similarly for S_2 we're given a math model that makes sense but I don't know if I see exactly how I would use it.

    ReplyDelete
  11. Overall I thought this paper was great. It's a cool algorithm, good results, and a satisfying extension of the G3 framework. There were two points where I thought the paper could be improved. First, without background knowledge, G3 would be very confusing. I think it's probably sufficient to leave it to another paper, but there is very little description of all the terms within this paper itself. A more interesting and important criticism: it would be great to have examples of failure cases. I find it somewhat surprising that there would be 94% failure for hand-written requests, rather than 100% - I'd imagine humans would be very very very good at this task. More interestingly, what are the cases where the S2 and S1 functions create failures? Were there cases where the human had the right idea, but the 60s timer ran out because of hand-off failure or other issues? A third (minor) point: it might be tangential to this purposes of this paper, but I always like correlations to human cognition. Is there evidence that humans have fairly complex models of the extent to which collaborators understand a proposed sentence?

    ReplyDelete
  12. For me, generally, I think this paper is easy to understand and your goal is very clear.
    The first which I suggest is that in the video which you showed us in today's lecture,
    when the robot asks people for help, even though some times people feel very confused
    about robot's information, but people can not have more communication with robot.
    In this case, people have to guess the robot's meaning. So I think you can improve the
    robot's skills a little bit to make it have a simple communication with people, in this case
    the probability of successes will be much higher .

    Another part which makes me confused is that even though people grab a wrong piece of furniture to robot, how can people know? Maybe this question is not your research topic, but I feel like if robot can let people know if it's a wrong piece of furniture, it's easier for people to grab a right piece in the later.

    ReplyDelete
  13. I think that one thing which could’ve been done is to include robot requests at opposite ends of the detail spectrum. For example, there was the extremely vague “help me!” but nothing in the opposite extreme. Given a range of possible requests, it may then be possible to determine the best from a combination of efficacy and efficiency.
    It was also unclear to me how the robot detects failures in request response. How long do humans have until it considers their response a failure? Maybe I just missed this, but it didn’t seem well described to me what would happen in this case.

    ReplyDelete