Tuesday, September 17, 2013

Probabilistic Grounding Models (I)

There were some questions in the comments about why we've been studying the readings so far, and also about how it relates to robotics.  As a result I've modified the schedule slightly, so that next week we jump into two papers about enabling robots to follow natural language commands.

The first paper describes an end-to-end system that my collaborators and I created for enabling robots to follow natural language commands about movement and manipulation.  The paper is currently in submission to a journal and not currently published, so please don't distribute it:
The second paper describes a different approach to following natural language commands by mapping between natural language and a symbolic language:

Post a short (200 word) response to the following question: Compare and contrast how these two papers represent word meanings.  What is being learned?  What is given to the system?  What tradeoffs are being made by the two different approaches?

19 comments:

  1. The two papers have a very different approach to learning to interpret natural language into language. While Tellex's approach has predefined actions, knows word meanings, and understands basic sentence structure, Chen's approach starts with none of these things. Tellex's system is based around providing actions, and then learning on a set of possible commands for that action so that, given a command, it will know to act in a certain way. Using assistance from an understanding of grammar and language it can take these commands and parse them as to understand different commands of a similar form. Chen, on the other hand, trains his system on {instruction, action sequence, state} tuples. His system has no prior language knowledge, and learns from given tuples how a certain action sequence affects a given world, and what actions correlate to what instructions. He also uses lexicon learning to break up instructions and their related actions into related sub-parts, allowing the system to understand commands it has not seen before.
    The major trade-off between the two systems is the difference in priors. Tellex’s system focuses on a single language with set semantic rules and a set of allowed actions, while Chen’s allows for any language with any actions resulting in any changes. Tellex’s system forces the sacrifice of mobility, while Chen’s sacrifices some understanding capabilities by not having some parts of language parsing as a prior, especially in regards to words with multiple meanings that are context dependent.

    ReplyDelete
  2. In G^3 framework, words are related to aspects (objects, places, paths, or events) of the external world (i.e., groundings). The robot learns to map from linguistic constituents of a parsed command to their corresponding groundings. During training, the system is given a parallel corpus of natural language commands with the corresponding robot actions and environment state sequences. The system makes certain independence assumptions in order to factor the joint distribution and to make computation tractable.
    In Chen and Mooney's system, the robot learns to produce correct actions from a pair of a natural language instruction and a description of the external world. The system is given training data in the form of triplets that contain a language instruction, a sequence of actions and a world description. While G^3 framework treats each instruction as a tree structure, Chen and Mooney's system treats instructions as flat sequences which requires no prior linguistic knowledge, thus as indicated in the G^3 paper, may have problem while processing variable arguments or nested clauses.

    ReplyDelete
  3. Chen and Mooney describe a system which is given a natural language command and observes a human demonstrating a navigation task. Without any assumption of word or semantic meaning or syntactical structure. The observations of this demonstration are then mapped to the command. Whereas Kollar et al describe a system that assumes that certain words are mapped to specific parts of the world (e.g. nouns to objects, verbs to events, etc). Their system learns on a collection of commands generated for a demonstrated task.

    With regards to the language, Chen's and Mooney's system probably heavily biases the meanings of specific words from which to gather meaning in a command. Thus their system is likely to solely stay in navigation. Whereas the approach of Kollar et al is more likely to associate meaning with more nuanced words, especially prepositional phrases, adjectives and adverbs which can allow it to work in mobile manipulation as well. However, it is more likely that Chen's and Mooney's approach can allow a robot to be trained in a completely new environment and domain without significant reprogramming.

    ReplyDelete
  4. Kollar and Tellex’s system represents word meanings by weights of features in factored graphs. For the mobile manipulation, binary features are used. Chen’s system divides words into two categories: actions and objects/locations. His system maps verb phrases to action plans and noun phrases to objects/locations.

    Given instructions and their positive and negative ground examples, Kollar and Tellex’s system learns to maximise the object function factored with feature values. Given English instructions and their corresponding human actions, Chen’s system learns to map the instructions to navigation plans, which are executable by MACRO.

    Kollar and Tellex defines a set of features that their system needs to know beforehand because their system operates in a continuous world. A set of predefined features enables it to achieve good performance. But to transfer their system to other domains, experts should give it a new set of predefined features.

    Chen’s system learns words and words’ meanings directly from input examples in a discrete world. It performs worse than other systems that are given predefined knowledge of the world. Since Chen’s system learns everything from training data, hypothetically it could generalize to other discrete domains.

    ReplyDelete
  5. G3 framework gives natural language commands with positive and negative examples of groundings for each constituent in the command. It knows word meanings and infers the type of grounding variable by mapping noun phrases to objects, prepositional phrases to places and verb phrase to events. G3 framework defines a probabilistic graphical model dynamically and it can understand word novel commands not present in the training set. Words are parsed into different structures between G3 framework and Chen's system, and the search space in G3 framework will increase rapidly if number of groundings(objects, paths and events) increase. For Chen's system, it has to learn the meaning of every word. Natural language instruction, observed action sequence and description of current state are given to the system. Chen's system is able to understand different languages and different actions but Chen's system needs to keep the navigation plans relatively clean in order for KRISP to learn effectively (otherwise its accuracy will be reduced).

    ReplyDelete
  6. On a very high level, the Chen approach reminds me of Behaviorist psychology, while the G3 model reminds me of the cognitive psychological approach. (And their consequential advantages and disadvantages correspond as well.) That is to say, Chen is effectively doing pattern recognition with SVMs to match input phrases to output action plans. The G3 model however, tries to create an internal representation of objects (grounding) and chain them together in Noun-Verb-Preposition combinations. Indeed, the G3 model explicitly assumes this breakdown, and assumes their independence to make the computation tractable. The Chen model doesn't assume linguistic structure. As a consequence, it is more flexible but less robust, and vice-versa for the G3 approach.

    Another important consideration is language generation. It is helpful for a robot to be able to ask for help when it is confused. The G3 model grounds nouns, actions, and prepositions all separately, leading to a natural model for language generation. As Chen et al note, their supervised data could certainly be used for a language generator, but their work doesnt directly lend itself to a generator. The G3 model provides more in this regard.

    ReplyDelete
  7. The G3 framework (Kollar et. al) and Refined landmarks plans (Chen at. al) try to perform/ plan a set of actions given a natural language command.
    The G3 approach to this problem is to learn a set of bindings to objects, events, places, etc. from a training corpus and predict a binding and perform the set of actions given a command. The Refined landmark plan approach is get the best plan with the given command and landmark corpus, and simplify it such that the plan cannot be performed without the most basic set of landmarks. In their assumption it simplifies the problem and avoids dealing with complex syntactic structures that can come up in human speech.
    The trade-offs are that the G3 system is more complicated than the Refined landmark system. It tries to take the syntax in as a whole, and gets the best sequence of actions out, by maximizing the joint probabilities of these bindings. The Refined landmarks approach is simpler, but this I assume allows it to deal with only flat structures in sentences. It removes a lot of landmarks from its system after the refinement only because having them would complicate its planning, but a lot of significant verification might be lost there.

    ReplyDelete
  8. Chen and Mooney's approach represents meanings as plan components that correlate with certain phrases. G^3 represents meaning through probabilities of linguistic units being associated with certain objects, actions, and locations in the world. This G^3 system is more flexible and can represent different kinds of meanings unlike the Chen and Mooney approach which is limited to instructions. The G^3 approach requires a corpus of commands with examples of correct and incorrect groundings but it also has a kind of built in understanding of language's parts of speech and how they map to things in the world. Chen and Mooney simply require labeled examples of actions with no prior knowledge. The primary tradeoff is that Chen and Mooney's system is simpler and requires less prior understanding of language; it has a limited domain with little use for nouns or objects in the world. The G^3 system is more complex, makes certain assumptions about language, but provides a wider framework for thinking about meaning because it works with a wider variety of parts of speech and situations. More generally, it seems like we'll have to accept the fact that some built-in prior knowledge of how language works will be required to make the problem of grounding tractable.

    ReplyDelete
  9. In the G3 framework, there is a mapping between words in natural language and specific terms like object and action. Then it can parse a natural language command to single words or phrases and then find corresponding meaning using the predefined "dictionary". It also uses a training data to help robot to learn what actions are associated with what kinds of actions. So if predefined or similar commands come in, the robot will know what to perform. Chen and Mooney's use data in the form of { instructions, actions sequences, description of the world} to train the system. It requires no predefined knowledge.
    I think the G3 framework concentrates on specific aspects while Chen's method concentrates on general aspects. That is to say, G3 framework is capable of conducting some sophisticated commands in certain trained aspects, and Chen's system is better at performing actions in a broad area.

    ReplyDelete
  10. G3 models word meaning through the weights of a 2-way logistic regressor that learns the goodness of fit between linguistic features of a multiword expression and visual features of an ordered set of entities. G3 is provided with a statistical parser as well as well engineered features about the real world, an object detector, and manually annotated training data.

    Chen/Mooney's system models word meaning with a system called KRISP which is described briefly in the paper as being an string kernel SVM with a structured output of the plan, capturing the intuition that phrases that are spelled similarly lead to similar commands. C/M is provided with only instruction/action/world tuples.

    Both systems can handle novel words - in C/M the novel phrase will associate with low edit distance support vectors while with G3 I assume there are some linguistic features that operate at the POS tag/morphological feature level.

    One big difference is that C/M is learning the correspondence between words and commands directly while G3 is learning how to build an abstract representation (the grounding graph) that can be used to plan commands. This is a tradeoff because for simple domains with a very limited set of actions like the one tackled in C/M, training data can be automatically obtained from action/instruction pairs as shown in their paper. G3 requires annotated grounding graphs where entities in the world are connected with parse tree constituents to serve as positive examples.

    ReplyDelete
  11. Superficially, both language systems use a graph/hierarchical structure to represent the meanings of words. In the Kollar/Tellex paper, a structure called a Generalized Grounding Graph is used to extend the idea of a parse tree to physically ground meanings in the real world. The G3 approach enumerates several types of grounding in which a word/phrase can be used to represent a physical object, a location in the world space, a path through the world space, or a series of actions to be performed involving. The interpretation is expressed as a factorization of the original command into a series of these such groundings. The Chen/Mooney approach is similar in its focus on developing a plan of action based upon the input statement (and is daringly ambitious in its attempt to begin with NO a priori knowledge of language, or commands), but differs in that a statement is directly interpreted as mapping to a series of actions in the real world. The focus in this approach is exactly translating the input statement into a series of actions and verification steps. In both approaches, a probability distribution over phrases/their groundings is used to determine what the most likely interpretation of any phrase is.

    ReplyDelete
  12. Kollar and Tellex's approach used Generalized Grounding Graphs to learn the relationships between parts of sentences and groundings. They introduced a correspondence variable which captured the dependency between grounding variables and language. Then word meanings are represented as weights associated with feature functions. The corpus they used for training is manually annotated which maps constituents of a command such as nouns, prepositions or verbs to objects, places, paths or action sequences. So given a command, the system calculates the maximum probability of the feature function to find most probable groundings.

    While Kollar and Tellex's paper employs linguistic knowledge such as structure of the sentence, Chen's approach is somehow more directive. In his paper, no prior linguistic knowledge like syntactic, sementaic, or lexical knowledge is used. The system simply employed observation data to do the whole understanding thing. A base navigation plan is constructed from the observation data and a refined plan is generated using lexicon learning. Pairs of plans/refined plans and instructions are the input of the semantic parser learner which learns a set of string classifiers that decide to construct meaning representations.

    The tradeoffs being made by these two approaches are the prior linguistic knowledge they used and the learning precision they achieved. With the use of linguistic knowledge and other knowledge, Kollar and Tellex's approach achieved a high precision. In Chen's paper, the system uses the intrinsic meaning under the domain language as the input to the learning system. For here the intrinsic meaning is the plan for route natural language navigation instructions.

    ReplyDelete
  13. Both of these two articles mentioned "ground", however they used the different way to let robot get word meanings. In the first article, authors created a 3G framework to help robot understand words. According to the compositional and hierarchical structure of national language command,3G can define the grousing graph model dynamically. Authors used binary features for each factor. These features can help robot to make a decision which value is right for linguistic constituent. Geometric feature can help robotic to know the relationship between object, places, paths and events. By matching the factor with nearby project or some particular location, machine can get the meaning of words. According to the first article, they put the natal language command in the system, and by using this way( which make robot know word meanings), it enables robot to understand word meaning which dose not present in the training set. In this way, robot is more like genus. However, in the second article, authors pointed out a new way, which is they created a novel system which can make robot learn semantic parsers by only observing the action of human without using any knowledge about linguistic. The system gets some observed actions, and uses this action to infer a formal navigation plan for each instruction. This research shows a new way that robot can learn semantic parsers by following human action without getting linguistic knowledge.

    ReplyDelete
  14. The surface goal of both papers is for a robot to be able to follow natural language commands. The Mooney and Chen article focuses solely on directions from one position to another, while the Tellex included other applications such as fork lift operation or wheelchair assistance. In Mooney and Chen approach, the system starts with no linguistic information (nothing about syntax or semantics) and is given a pair of a natural language command e and a formal navigation plan p (constructed from an action sequence a) which it uses as a training corpus. The Tellex system starts out with more linguistic knowledge, notably the parse structure of the commands. The advantage of this is that it allows for different pieces of the sentence, e.g. prepositions and noun phrases, to be analyzed separately. The tradeoff is that it requires more supervision in training initially.

    ReplyDelete
  15. Chen & Mooney's approach is very simple, but a relatively crude model for
    language directions -- using correlations between phrases and words in a
    large corpus to figure out what a user might be meaning. This treats language data as
    almost a bag of words, inevitably throwing out much of the structure.
    Their results (on a very limited domain) are pretty good nonetheless, and it's
    certainly interesting to see that such a direct approach works to some
    extent.

    In G^3, sentence structure is extracted and leveraged for independence
    relations in a graphical model. This makes the problem of inference over
    all the objects more tractable. The result is object groundings, but additionally
    the structures and meanings of the words which later can be converted into a plan.
    It seems slightly more flexible and more likely to be robust to variations in
    input, at the cost of much higher overhead in parsing the input, etc. The fact
    that it extracts structure would be especially important in domains where
    there's a lot of variation and span to commands (most of them), as learning
    direct correspondences would require more and more data.

    ReplyDelete
  16. The most significant difference between two is that G3 does make direct relations from a parse tree to grounded meanings and form probabilistic relationships between the two whereas Chen & Mooney’s approach use n-grams to bind to navigation plans and later use semantic parser called KRISP to learn to parse n-grams to deduce the linked navigation plans.
    Since G3 is making use of language structure more extensively in this mapping phase, it could also easily expand its ability to mobile manipulation application to understand not only verb phrase or noun phrase, but also the preposition phrase for robots to do more sophisticated manipulation actions. Chen & Mooney’s approach lacks this part and it is much harder to expand their model to adapt more complex verb structures unless they also adapt a similar model.
    However, both approaches made similar assumptions in achieving navigational tasks. G3 also did the trade offs by ignoring noun phrase variations with prepositions. It used web-based search approaches to robustly infer a location with landmarks. Chen & Mooney’s approach uses refined landmark plans so that only relevant landmarks can be parsed to re-generate plans.
    Both approaches has limitations if the bank of grounded languages is not limited to a specific context. However, compared to symbolic models in the past, it can handle variability in natural language robustly enough so that humans can give natural language commands with ease. G3 also showed its direct applications to real world robotics and it is its probabilistic approach’s major strength.

    ReplyDelete
  17. The G3 paper uses a grounding graph to provide a mapping between words and grounded objects/prepositions/verbs. The structure of this graph, topologically, allows the robot to infer the relationships between different symbols. The binary features associated with nodes give symbolic and actionable meaning to the phrase the robot has been given. In addition, the robot primarily attempts to ground symbols in objects that are found in its vicinity, making the problem a much more tractable one.

    On th eother hand, Chen & Mooney's approach uses no prior linguistic knowledge. Instead they have the system observe a series of states along with descriptions of how the states are changing. In my opinion, this makes the problem of symbol grounding more of an unsupervised one. The state descriptions could be in any language (even an invented one) and the robot should still be able to ground the words in some way.

    ReplyDelete
  18. Chen and Mooney's system system learns to infer navigation plans by observing (training on) the action sequences of humans following natural language instructions (that correspond to navigation plans). Their system is given essentially no prior knowledge about language – the goal is to learn navigation plans exclusively on pattern recognition between an instruction and previously (and properly) executed actions. G^3, on the other hand, seeks to ground 'linguistic constituents' (objects, places, paths, and events) to entities in the real world in order to improve the efficacy of robots following natural language commands across broad domains. The higher level goals are essentially the same, but the methodologies (and desired scope) are quite different. Additionally, G^3 is given a bit of parse knowledge in order to aid its grounding mechanism by allocating separate grounding mechanisms for different parts of speech (nouns → objects, verbs → events, etc.). The tradeoff here is varying how much linguistic knowledge the robot is aware of with the flexibility and efficacy of the system. The G^3 system, as demonstrated in section 4, is useful across several robotic domains (forklift, wheelchair, PR2), while Chen and Mooney's system is more restricted to the specific domain of the navigation problem (which, admittedly, is quite a general problem).

    ReplyDelete
  19. The G3 algorithm looks into refering to the robot's physical surroundings which are described by people, called "Ground Language". It takes in:
    (1) 3d shape words to describe the environment,
    (2) a sequence of trajectory points to describe movement direction,
    (3) a set of pre-defined textual tags to help robot to model environment.
    And then G3 algorithm builds a graph with probabilities to model language structures.
    The Chen and Mooney's algorithm investigates in a system that can take natural-language navigation instructions and then move to the destination based on observations. The system takes in training data which contains
    (1) natural language instructions,
    (2) observed action sequence(system observed human how to react to the instructions),
    (3) current environment state.
    Comparing the two algorithms, we can learn that G3 is more structured with its graph to model word meanings. The robot can infer the word meaning based on the location and context. Chen and Mooney's algorithm requires no prior linguistic knowledge and can use free-form natural language. The language form is left out. Thus the accuracy of its understanding of words may be harmed.

    ReplyDelete