Tuesday, September 24, 2013

Probabilistic Grounding Models (II)

This week we will continue our investigation into probabilistic grounding models.  We will read a paper about a language understanding system developed in collaboration between computational linguists and roboticists at the University of Washington.
This paper was written after the Chen and Mooney paper (2011) and the Tellex et al. paper.  (Even though the Tellex et al. paper hasn't come out yet, the original contributions appeared in conferences in 2010 and 2011.)  In some ways it unifies ideas from the two approaches:  it jointly learns a semantic parsing model as well as attribute classifiers for colors and shapes. 

Please post on the blog by 5pm on Wednesday a roughly 500 word answer to the following question:
  • Compare and contrast how this paper represents word meanings with the previous week's readings.  (Chen and Mooney and Tellex et al.)  What is being learned?  What is given to the system?  What tradeoffs are being made by the this approach, compared to the other two?

20 comments:

  1. (1) The system learns a set of visual classifiers, and a semantic parser that effectively encodes representations of word meanings that correspond with the learned visual classifiers. The primary advantage here is that the system employs both visual classification and semantic parsing in an integrated way, so that language meaning may be bound to perception. As a result, it is capable of inducing novel attributes for objects, as demonstrated in Section 8. One critical difference of this paper is that the overarching goal of the system is to learn new attributes for objects, while G^3 set out to ground language to specific objects (and ultimately, elicit action based on groundings), and Chen & Mooney sought to turn natural language into a formal, symbolic, imperative language that generated plans which mapped to the content of the natural language. These goals are all riddled with differences, though are fundamentally solving slight variations of the same problem.
    (2) The system is given an algorithm (FUBL) for learning factored CCGs from natural language. Additionally, the system is given training data which includes z_i and w_i tags (linguistic uncertainty and perceptual uncertainty, respectively (Section 5)) for each datum in the data set. The data consists of sets of 'scenes', where each 'scene' contains a natural language sentence, an image, and “indications of what objects are being referred to” (Section
    (3) One primary tradeoff is that the domain of possible features is limited to color and shape. Thus, novel categories within domain are capable of being induced, but the system is restricted by what the visual classifier is equipped to classify (i.e. what populates the set C). It is not clear if it could be extended to handle non-visual object attributes, such as texture, or weight, though I suspect that there might be some difficulty in achieving this task, given how tightly bound the visual classification is with learning word meaning. Perhaps the classifier could be generalized as a general object classifier that relies on multiple perceptual faculties. The other tradeoff is that FUBL seems to be shouldering a lot of the burden here by parsing natural language into lambda calculus 'lexemes'. It appears as though a large chunk of the semantic parsing is being handled by FUBL, and not by their own mechanism described as a primary constituent that the robot must learn (Section 1).

    ReplyDelete
  2. In this new article, authors assumed that if the robot can know the character of objects and ground the words' meaning by mapping objects to a perceptual system which can help the robot to identify the specific physical object, then the robot can understand a human sentence. According to this assuming, they joined the "semantic padding" model and "visual attribute classifier" as a joint modal to help the robot understand sentence. They had two steps to join languages and vision together. First, between logical content and classifier, they created an "explicit" model. Then, by using the "execution"model, they could help robot determine which scene object should be chose in that specific logical sentence and classifier. For example,you want a robot to choose a red apple for you. "red" and "apple" are classifiers. The logical content of "red" is color and the the logical content of "apple" is food. Then the robot needs to choose an object when both classifier return true. This article used a little different way with the previous article. In G^3, it used a relation between parse tree and grounding meanings to get probabilistic to help robot understand word meanings. According to the compositional and hierarchical structure of national language command, G^3 could define the grousing graph model dynamically. In Chen & Mooney’s article, they created a novel system which can make robot learn semantic parsers by only observing the action of human without using any knowledge about linguistic. The system gets some observed actions, and uses this action to infer a formal navigation plan for each instruction. The new knowledge I leant from this article are "Semantic Parsing" and how to use semantic parsing and grounding to account for the whole novel input. The things which need to give to system are: visual attribute classification, and semantic parsing(language sentences, logical forms). The system should add the new classifier and update classifier set. This article combined language with vision. But not like the normal grounding learning, this article more focused on semantic analysis rather than single word's meaning. It not like the general G^3. This article create a model which can account for entirely novel input. But after reading this article, I am not sure whether this model can help robot understand the sentence which has postposition( because almost all the examples which authors showed in this article were sentences with noun, adjective) and how can they give classifier to postposition.

    ReplyDelete
  3. If I understand this paper correctly, the system they've developed uses a lambda-calculus-like classifier model to encode the "meaning" of a word as a function which will be true or false when applied to a real world object in a specific scene. The approach tackles the grounding problem by initializing a set of linguistic classifiers to a normal distribution, and then incrementally improving them through the set of training data until they hopefully have become bound to specific groups of objects in the real world which the attribute classifiers then represent. This grounding of classifier to group is what is being "learned" by the system. The system is given a set of training data from Amazon Mechanical Turk, in which people have described groups of objects in a scene. These scenes, along with the descriptive sentences are then analyzed in order to perform the learning task described above. This system is similar to the approach of Tellex et. al. in that the main idea is to ground linguistic utterances to real world objects, but differs in that it assumes nothing about the potential nature of objects in the world (ex. the Tellex paper classifies four separate types of grounding -- objects, paths, places, etc..., while this one simply understands objects as being composed of a set of attribute values). In addition, this paper does not approach the idea of commands, or action execution. I'm worried about this sort of approach also in its fixation on the lambda-calculus style understanding of lexicographical content. As we read in (I can't remember off the top of my head... the linguistics paper with that one very complex definition of "the"), this structure begins to become unhinged as more complicated language (or even simple language which is not immediately related to the properties of physical objects) is used.

    ReplyDelete
  4. The University of Washington’s system learns how attributes apply to perceived objects by pairing sentences, selected objects, and entire scenes. To do this, it makes use of the FUBL semantic parsing, which combines natural language sentences with lambda calculus defined restrictions to learn which natural language pieces that specify each portion of the lambda calculus block. These parsings allow the system to learn new terms, and, paired with scenes, the classification of new things to which those terms apply, such as what is “yellow” or a “block”.
    This system shares the learning of the semantic parser with Chen’s system, but goes a step further and extends this learning by focusing on attributes and what they mean. While Chen’s system focused on sentence, action, world state learning, allowing for the system to learn how a sentence should change the current state, UW’s system focused on learning how each sentence characterized an image, and by using this characterization built classifiers for images. Chen’s system had no need for this, as it was a virtual system, so it seems to me that UW’s system is partly an extension of what Chen’s did in virtual reality to the real world.
    UW’s system more models G3 in its interaction and identification of how sentences identify objects and attributes in the real world. Unlike G3, it avoids predefined language constructs, allowing the learned semantic parse to handle this feature.
    The UW does make several tradeoffs. While the system currently only recognizes shape and color attributes, it seems like it could be extended to encompass other parameters, making it appear fairly scalable. This type of learning seems incredibly helpful for object identification and description, but doesn’t seem to provide much insight into performing actions based on natural language commands. Although, as we discussed in class earlier, you can represent commands in the same lambda calculus form as attributes, so it could be more reasonable that I think it is.
    Another significant tradeoff created by the ability to recognize new words and meanings. This ability means that words must be aligned with classifiers, which results in a fairly large computational requirement (see “Aligning Words to Classifiers” in section 6). To be fair, this is simply a requirement of learning new words automatically, but is not a cost faced by systems that do not allow for new word definitions through usage. (See SHRDLU, where new definitions are created by specifying something in terms of previous definitions).

    ReplyDelete
  5. Matuszek’s paper uses visual perception to ground word meanings in the physical world to learn representations of the meanings of natural language. Learning is performed via optimizing the data log-likelihood using an online, EM-like training algorithm. This system is able to learn accurate language and attribute models for the object set selection task, given data containing only language, raw percepts, and the target objects. By jointly learning language and perception models, the approach can identify which novel words are color attributes, shape attributes, or no attributes at all.
    What the robot must learn including visual classifiers, the meaning of the individual words that incorporate those classifiers and a model helps to analyze sentences. For example, given two word ‘yellow’ and ‘blocks’, it must get the attributes and mapping them to a perceptual system. The approach is based on visual attribute classification and probabilistic categorical grammar induction. Given a set of scenes containing only sentences images, and indications of what objects are being referred to, it introduces new concepts. For the unknown object to the robot, by referring to the relevant objects in training scenes, the robot can recognize the object.
    The joint learning approach can effectively extend the set of grounded concepts in an incomplete model initialized with supervised training on a small dataset. Given a sentence and segmented scene objects, it learns a distribution over the selected set. The approach combines recent models of language and vision, including: a semantic parsing model that defines a distribution over logical meaning representations for each sentence; a set of visual attribute classifiers, where each classifier defines a distribution of the classifier returning true for each possible object in the scene.
    Known the classifier, the execution model can give the result that what scene objects would be selected. Due to semantic parser, classifier, confidences and a deterministic ground-truth constraint, the execution model have uncertainty. Each example given to the system contains a sentence, the objects and the selected set. It can automatic extend the model to induce new classifiers that are tied to new words in the semantic parser.

    ReplyDelete
  6. The system in Matuszek et. al. uses established algorithms for semantic parsing (from Zettlemoyer & Collins and Kwiatkowsi et. al.) and off-the-shelf algorithms for learning visual classification of simple objects on a table top. The novel learning is the process by which they associate these inputs and train them together. The key part of how it models meanings of actual objects is through the alignment of logical forms and classifiers in order to provide a grounding for semantic forms. For example the meaning of green is understood by taking the constant “green” out of a semantic expression like “λx.color(x, green)” and aligning it with a visual classifier that returns true when a given object in the viewport is green.

    The system is given labeled examples where several objects are identified in plain English and it simultaneously trains the semantic parser, the classifier, and the alignment. In order to bootstrap the process of estimating parameters it is also given fully labeled data to separately train the parser using labeled semantic outputs and the classifiers using separate labeled classification output. This is identified as a weakness that future work should seek to eliminate. Ideally such a system would be trainable only using combined examples.

    An advantage of this system is it's modularity and extensibility. Since we have well defined interfaces for classifiers and semantic parsers these can be separately extended or modified. It requires a very small amount of highly labeled data for bootstrapping but all the rest of the training corpus is relatively simple, which is another advantage. Presumably it could be applied in many different domains if the necessary visual classifiers are available.

    Compared with G3 and the Chen & Mooney approach, Matuszek's is noun and adjective focused. It relies on visual classifiers which are best used for shapes and colors, but more generally the focus on classifiers basically limits it to problems of identification. The other two approaches are focused on verbs and their use in commands requesting action be performed. It's unclear how the Matuszek approach could be used in order to interpret commands. The semantic parser would have some constant output for a command like “move to,” which would have to be grounded in some robotic procedure rather than a classifier. Learning how to perform actions would have to be integrated somehow and it would have to know how to differentiate between commands (verbs in a command sentence) and classifications (adjectives, nouns, verbs used descriptively).

    G3 works with more parts of speech in this sense but in order to so it had to make assumptions about how nouns correspond to objects and verbs correspond to actions and it required training on how specific particular commands are performed. Chen and Mooney focused almost exclusively on movement commands by finding a kind of alignment between commands and robotic actions. Perhaps this approach could be combined somehow with that of Matuszek et al to work towards a system that can identify objects and perform actions using such identification.

    ReplyDelete
  7. This approach attempts to learn both the semantic parsing of a sentence and the grounding of that sentence to objects. Unlike Chen and Mooney's work which learns the semantical parsing, and G^3 which learns the grounding of sentence parsing to objects. The semantic parsing this system learns is much more bounded than Chen and Mooney's work. Instead, it creates a list of high probability mappings from the natural language sentence to a lambda calculus model. It uses these probabilities in its calculations to later choose the most likely meaning. Like the G^3 technique, it also uses statistical techniques to determine the most likely grounding of parsed expression to grounded classifiers.

    The system is given likely parses of sentences and their associated probabilities. It's also given classifiers associated with different attributes of objects, and a limited number of supervised examples to help bootstrap the system. The coupling of sentence parses and detected attributes is learned.

    This system is heavily tied into scene descriptions, but not as much definitions of spatial semantics. This sort of limits the extensability to robotic tasks. It's also not clear how the system could accommodate prepositions or spatial descriptions with a constantly changing scene. Would this system completely fall apart on Shurdlu like tasks?

    ReplyDelete
  8. The system described in this paper is perhaps unique in its use of classifiers. The typical grounding task is broken apart into two separate parts, semantic parsing and visual classification. This decomposition is almost a complete decoupling and allows for each subsystem to be treated separately (to a limited extent). I believe that the best way to compare this system to either Tellex et al.or Chen and Mooney is to look at the semantic parsing and visual classification separately.

    The semantic parsing system takes an approach similar to that found in Chen and Mooney, in that it represents a sentence using lambda calculus. This is an important high level similarity, because it shows that despite implementation details, both papers treat language using similar formalisms. Coming back to the methodology used to analyze semantics, we see that an algorithm (FUBL) is used to learn Combinatorial Categorical Grammar. This allows for the parse model to be initialized, using training data obtained from workers on amazon mechanical turks. After this point, this paper differs substantially from Chen and Mooney, because it begins to tie in the result of the semantic parsing with visual classification. It is also interesting to note that while there are some similarities with Tellex et. al., those similarities are limited to the fact that both analyze input sentences as parse trees.

    For the visual classification task, several classifiers are trained in a supervised manner, again on data tagged by workers using amazon mechanical turks. The features used for classification are selected in such a way that new and never before seen objects should be fairly recognizable by the system. Because neither Mooney and Chen nor Tellex et al deal directly with image classification, this particular part seems to diverge from the readings we have previously seen.

    Perhaps the truly novel part of this paper is its combination of the information gathered in the previous two steps. Generally speaking, the problem at hand is to “automatically map a natural language sentence x and a set of scene objects O to the subset G of objects described by x. The results of running the semantic parsing is a distribution over logical meanings for the sentence x. Similarly, the image classification part returns a distribution for each individual classifier over all available objects. Next, the model allows the selection of all objects in O by the logical parse of the sentence, given the results of the image classifiers. In this respect, the paper is similar to Tellex et al. Both models attempt to find the set of objects that maximize the probability of a particular grounding for a sentence parse

    ReplyDelete
  9. This paper proposes a joint model of language and perception to learn the grounded attributes. It uses a semantic parsing model as the language model and a set visual attribute classifiers as the perception model. So given a natural language sentence and a set of objects, this joint model can automatically map the meaning of sentence to the desired grounded attributes. After initializing the model with a small supervised data(annotated logical forms and vision classifier outputs), it can infer quite accurate grounded attributes mapping to words in the sentence under previously unseen visual scenes and unknown words in the sentence.
    The semantic parsing language model this paper incorporates is based on FUBL algorithm which can learn probability over a parse that yields to logical form for a given sentence. The perception model is a set of classifiers using depth values and RGB values correspond to shape and color attributes respectively. So the input of the model are annotated logic forms of sentence meaning, labeled classifiers and the referenced objects of the scene for bootstrapping the model and unseen scenes and unknown words for training and testing.
    So the system in this paper mainly focuses on learning new representations of meanings of natural language to the ground attributes using visual features. While Tellex's system uses G^3 model to learn the relationships between parts of sentences and groundings. Tellex's system generates Grounding Graph according to the grammar tree generated by Stanford parser. So the semantic parser is sort of determined in Tellex's system, but it provides uncertainty in this paper's system. However, in Chen's approach, no prior linguistic knowledge like syntactic, sementaic, or lexical knowledge is used. His system only uses the observation data to learn the model.
    The tradeoff made in this system is that it limited the grounding attributes to only visual features: shape and color. I'm not sure if this system can expand more features without impacting the system performance. And this system is less general than Tellex's system because it does not ground word meanings to places or events like Tellex's system.

    ReplyDelete
  10. Matuszek et al.’ joint model is composed of two big systems: the language system and the vision system. The language system maps words to their corresponding objects, which are perceived by Matuszek et al.’s vision system. In order to map the language input to objects, the language system transforms natural language input sentences to forms of lambda calculus. The language system can validate whether an object it wants to find is detected by the vision system. Their vision system consists of many object classifiers that indicate whether an object is present in an environment. The vision system can compare what it has classified to what the language system has found in language input sentences. Two systems cooperate to understand the environments.

    Matuszek et al.’s model is different from Chen and Mooney in that Matuszek et al. incorporate the computer vision information together with the language information to understand the environments. Chen and Mooney’s system completely lacks the ability to perceive the environments visually. Matuszek et al. goes further than Tellex et al. in that Matuszek et al.’s vision system can recognize new objects that are not predefined. Matuszek et al.’s vision system jointly modeled with the language system can handle unseen commands and objects. Their system can generalize more broadly than Chen and Mooney’s and Tellex and et al.’s.

    Matuszek et al.’s system learns a joint model from the training data given a semantic parser and a few object classifiers. While doing so, it expands words that the semantic parser can understand and construct new classifiers that can detect objects that the input classifiers cannot recognize. Their model attempts to learn in less limited environments compared to environments in which Chen and Mooney and Tellex et al. attempt to learn. However, for Matuszek et al.’s approach to work well, the parser and the classifiers they are adopting should be very good. Given the fact that semantic parsers cannot really deal with “ill-formed” sentences and computer vision classifiers are not really good, it would be quite hard to apply their model to other data.

    ReplyDelete
  11. Matuszek et al. want to ground a word meaning to its representation in the perceptual model. The goal is to identify a set of objects in the world of objects that satisfy a natural language constraint given to the system, an e.g. -"Here are the yellow ones" should select the set of yellow objects from a given scene. For this three models are needed a visual classifiers with colour and shape information output, a semantic parser model that in this case learns a factored lexicon, and a joint model that pairs the logical constants learned in the factored model to the attributes output by the visual classifier. The contribution of this paper is in figuring out a joint distribution that when maximized leads to the pair of visual attribute (color/shape) and logical constants ("red"/"triangle"). Hence adjective word meanings are being represented by their attributes as calculated by a visual attribute classifier.
    It learns the pairing of a word with its visual attribute using a joint probability distribution. The parser is given to the system, so is the output of the visual attribute system, and also (I think) the basic difference between shape and colour using the bootstrap. Chen and Mooney's system learned a parser and output a plan that could be read by their "robot", it was given this basic representation/ language to connect to the robot; Kollar et al. learn grounds between objects/ trajectories and words but assume a given parser. Matuszek et al. method is much closer to the method in Kollar et al. because they also use a given parser and learn word associations to attributes instead of objects/ trajectories. This method seems more complete as it learns attributes instead of grounding objects themselves.
    The trade-offs here seems to be a relatively simple scenario. We have only a few visual attributes and ways to describe them. It could learn each object type as an attribute and pair these new "attributes" with their respective (noun) words and expand, but making inference would then get more difficult as with the Kollar et al. method. This method does not learn actions, but it would be interesting to see if we can combine an action word to the output of an action classifier.

    ReplyDelete
  12. The Matuszek et al. model represents meaning as the set of objects described by the language both in regards to its semantic and visual representation. In other words, it is set of objects that satisfy the truth conditions for both the semantic lambda-calculus representation and possess the visual attribute classifiers. It jointly learns visual classification (grounding) and semantic representation. The model is given ‘scenes,’ which consist of sentences, images and a demonstration of the reference between the language and objects. An interesting part of the paper is the comparison between the joint system and both the vision and semantic parser individually. Of course (because otherwise the paper would not be interesting), the joint system performed significantly better than either of them on their own. This joint learning is in contrast to the papers from last week. Although both the Chen and Mooney and the Tellex et al. models used both a semantic parser and an executer model, each only learned one of them. For the Chen and Mooney approach, the semantic parser was learned but the executer was given, and for the Tellex it was the reverse. In addition, their model assumes the learner starts with a small vocabulary and set of attributes from the beginning, and then automatically extends to include new classifiers. This is a tradeoff in that it requires more information, but is positive in that by definition the model is meant to increase to a larger domain than it starts out on, potentially eliminating the problem of scaling. While the study described in the paper is limited to a small domain of block attribute description, the authors end by saying they believe they will be able to scale the model well to incorporate a larger domain. I would be interested in seeing if that it true, and if so if it is a model that can be used to solve the scaling problem in other systems as well. Furthermore, while jointly learning language and perception improved results on the attribute learning task, the introduction of more learned parameters increases the room for error in estimation. The more variables estimated using the same data, the lower the confidence.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Three approaches are similar in that when these systems learn, a visual scene is given and a human provides a text to describe the scene. The end goal is also similar, given a text from a human (given a natural language paragraph or a sentence), the systems’ goal is to infer a world state (a set of objects chosen or a state of objects) or an action (a series of actions) to perform.
    In approaching this idea, Chen et al. and Matuszek et al. took an approach to adapt learning a semantic parser to infer either formalized actions or lambda interpretations of semantics. Two approaches differs in bringing in percepts to learn natural languages. Chen et al.’s approach did not use probabilistic perception models to infer new word meanings whereas Matuszek et al. use visual perception models with probabilities. Chen instead use Lexicon Learning algorithm to learn word meanings from its training corpus. Matuszek et al.’s approach uses strictly supervised training data to initialize the basic probability distribution of the given environment. It’s ability is in expanding its world knowledge or language understanding of a given scene starting from a basic set of training data. To interpret the scene with toys of different colors and shapes, it needs at least 150 annotated data items (x, z, w, O, G). Chen and Tellex’s approach does not need this. Compared to Chen’s approach, Matuszek has strength in that both the perception model and language model reinforces each other to learn new words. Moreover, this learning procedure can be done online (learn incrementally as a new word comes) whereas Chen has to as a batch. In Matuszek’s test, she proved that the combined approach results in better outcome in understanding new natural language interpretations.
    Tellex’s approach took a different approach that instead of training a parser to output a formal interpretation, it uses parsed grammar structures to learn object relations or action relations directly. It creates a grounding graph from a parsed tree of a natural language phrase and link this to perception models. In Tellex’s approach, all the grounding graphs are distributed to each language interpretations which can mean one thing or one action. In Chen and Matuszek, various natural language interpretations can be grounded into one formalized action or lambda interpretation. However since it is not grounded into one predefined meaning, one grounding graph can be extended to mean many other things. So, it opens a possibility that one language interpretation can mean different possible world states or actions.
    Matuszek’s approach only dealt with object classification examples. Different learning algorithms and perception models needs to be presented to tackle the navigation direction problems that were dealt in Tellex et al. and Chen et al.’s papers.
    In Tellex’s approach, if it wants to generalize navigation directions in different environments (to make the system work in different scenes) with learnt data, a different perception model needs to be adapted to learn action verbs. ‘Go to A’ may mean a specific trajectory, but, it can also mean ‘Going to A w/ possible trajectories in different environments’.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Matuszek et al. dealt the route following examples in their later work. Her approach is almost similar to Chen & Mooney's approach. The difference is that she uses lambda expressions as a formal language and this can actually be extended further than Chen & Mooney's navigation plans.

      Delete
  15. The biggest difference between Matuzek et al and the others is that the latter
    two output plans. Matuszek et al produces labels of segmented objects. These are
    pretty different tasks, and it's reflected in the kind of linguistic input that
    the system can handle. The authors note that inference with arbitrary logical
    expressions (meanings) in their system would be in #P -- their inputs are
    restricted to conjunctions of unary attribute descriptors. I'm not sure how it
    might be extensible to a richer variety of commands. I'm not sure that you would
    need a very rich language to capture visual classifiers either, which are
    descriptions rather than statements of arbitrary complexity.

    The system also learns a different kind of model. Their model takes visual
    classfiers and a semantic parser more or less off the shelf: the magic is in
    their combination. Their system learns correspondences between logical
    constants (terms in the lambda calculus) and classifier responses. I see a
    little bit of inflexibility here if the relationship was not one-to-one (mapping
    yellow to a single classifier, for example) -- and I don't think it would be in
    any reasonably complex domain.

    It also has slightly different inputs. In some sense, the visual classifiers
    are inputs, though those are (if robust) theroretically constants. This and the semantic
    parser also get bootstrapped with a small supervised dataset. At that point
    the system learns "online" with all components simultaneously from a set of
    examples: pairs of sentences and the indicated objects. On test inputs it
    takes just the sentence.

    Tellex et al and Chen & Mooney also have learned subcomponents (like the
    English parser in Tellex), but the input for learning in both is correspondences
    of natural language and actions (represented in various ways.) Of course, they
    too merely take natural language sentences for test inputs.

    Matuszek et al make a clear tradeoff in the expressiveness of their linguistic
    input for what is probably a more thorough framework. The inference in Matuszek
    is on the full joint distribution, conditional on the (interchangeable)
    visual classifiers and semantic parsers. Mooney & Chen, and I think Tellex et al,
    have a less thorough inference step, but more general input.

    ReplyDelete
  16. The Matuszek's approach is to combine semantic parsing models and set of visual attribute classifiers to map natural language sentence and a set of objects in the scene to a subset of all objects in the scene. If all functions return true, including colors, shapes, robots know what we talk about. In other words, using semantic models and visual classifiers to help robots handle object set selection task. We input language, raw percepts, target objects into the system. The data set contains a sentence, objects and selected set. Robots can learn object characteristics, such as color and shape by an online EM-like model. The method uses FUBL model for semantic parsing. UBL utilizes a factored lexicon and a log-linear model which produces the probability of a parse .

    The three methods are similar in the sense that they parsed in natural language instructions, calculate probabilities to decide what objects to take. The language features they take in are different. The Matuszek's approach learns new concepts from a set of scenes with objects containing certain attributes. Learning is done online and one scene at a time. To train in this way may be too expensive for real world objects. The example in the paper referring to a set of toys which is oversimplified outside world. The Chen and Mooney's method can learn from lexicon. Which is less expensive to get. And the approach cannot make use of language structures like G3.
    The problems three methods aim at solving are also different. Matuszek's approach solves object selection task. G3 solves how to navigate and instruct robots to do things. Mooney aims at solving navigation task . G3 and Mooney’s method can both instruct robots to go to new places or make new movements. Unlike Matuszek's approach, which based on known characters, G3 is more robust in that it is based on a probability graph. G3 has a broader scope of physical concepts such as objects, places, or events. Chen and Mooney's method focus on navigation task and they learn through observed human reactions. G3 and Mooney's methods both tackle different human instructions. While Matuszek's approach does not tackle that.

    ReplyDelete
  17. As noted, the Matuszek approach unifies ideas from both the Stefanie and Chen papers, but also brings a computer vision component involved that neither of them include. In doing so, it begins to tackle the problem of semantic understanding and knowledge representation and manages to get fairly impressive results in the process.


    There are a number of different components going into the Matuszek system. First is an image segmentation algorithm, taking images and returning sets of possible objects in the scene. At the parsing step, their FUBL algorithm takes a dataset of sentence to lambda representation mappings in order to learn a CCG. Finally, for the actual understanding task, they take a dataset of (sentence, object, and objects described in the sentence) tuplets. With this dataset, they are able to very cleverly learn the meanings of new words by seeing which objects are selected by unknown words in the lexicon. This is a major distinction between both the Chen and Tellex papers: both previous papers have no method of learning new meanings. On the other hand, I'd argue their task is fairly limited as well. They only verb is 'select the corresponding objects in this scene'. There is no exploration of object trajectories as in the Tellex paper or learning of action sequences as in the Chen paper.

    Indeed, despite using a combination of the other two approaches, their primary contribution seems to be in a different part of the perception-action chain and is therefore complementary rather than successive to the other approaches. Meaning to say, the Tellex paper's primarily contribution is in the inference stage, mapping words to objects, while the Matuszek paper's main contribution is using perceptual features to learn new words (indeed, in a rather clever way).

    I'm interested to think about ways the Matuszek approach could be expanded to other tasks. For one, it's not immediately obvious how to generalize their learning of nouns and adjectives to verbs; learning verbal features from video seems like a poor approach. On the other hand, tracking visual changes of the effects of a verb would be more tractable (or at least fall in the same sort of framework), and perhaps allow the model to learn (some) verbs. As a single example, 'Remove the red blocks' could be understood by first grounding the red blocks, then learning what happens to them after the execution of the verb 'remove' (that is, they disappear). I get the intuitive sense that this machinery is very limited, and more complicated techniques (perhaps some inspiration from inverse reinforcement learning?) are necessary for verbs.

    ReplyDelete