On Thursday, October 3rd, we will turn back to probabilistic grounding models, looking at one that takes a reinforcement learning approach.
- S. R. K Branavan, Harr Chen, Luke S Zettlemoyer, and Regina Barzilay. Reinforcement learning for mapping instructions to actions. In Proceedings of the Association for Computational Linguistics, page 82–90, 2009.
In this paper, all words are assumed to be describing actions. Some words used may also describe objects, but only in the context of directing the use of those objects in a certain action. The goal in this approach is to, for each sequence of words, map that sequence to a sequence of actions. Words are only understood in the context of the command they describe, and the objects (parameters) they reference, the whole of this triple comprising a formalized "action." This is somewhat similar to the Chen/Mooney approach, in which words are understood as correlating to a plan, or sequence of actions and verification steps. Also similar between this paper and Chen/Mooney is the use of the success/failure of previous assignment of meaning to action in order to guide future understanding. In word representation, this paper's approach is somewhat further from Tellex et. al.'s General Grounding Graph model. In the G3 approach, the intent is still to parse language into a robot-understandable action format, but in it words may be "grounded" by locations, objects, paths, or events, as opposed to just actions/objects. There is a more fine-grained correlation in the G3 approach as well, where almost every word or nuclear group of words in the sentence is understood to have a specific grounding.
ReplyDeleteThe approach presented in this week's paper is perhaps furthest from Matuszek et. al's lambda calculus-like word understanding. The Matuszek approach interprets a word as a function on objects which evaluates true on objects which satisfy the descriptor implied by the word, directly mapping words to real world objects. The focus there seems mostly on classifying groups of objects as belonging to specific linguistic classifications, and does not make mention of formalized action space. Is is somewhat similar in that there is very little a priori knowledge assumed in either case.
In the Branavan paper, word meanings are learned using a policy updating algorithm which uses a reward function based on whether or not it is possible at a certain point to interpret the currently-being-parsed group of words into a valid action for the current world state. Also taken into account are non-immediate rewards, such as a positive reward for an entire series of actions leading to a specific desired outcome. Divergently, in the Chen/Mooney paper, a graph-intersection-based method is used to learn a lexicon -- in this approach, navigation plans which have been parsed from meanings are associated with respect to their commonality among other groundings of the same words when used elsewhere. Common elements are kept, and the algorithm is iterated until the meanings converge (remain unchanged by iteration). Branavan's RL tactics differ fairly strongly from Chen/Mooney.
In the Tellex et. al. G3 approach, word meanings are learned probabilistically based on prior knowledge of the state/world space. This is more similar to Branavan than Chen/Mooney, but differs in that it is not RL but purely probabilistic. The Matuszek paper uses math which is quite honestly beyond me, but it seems as though their approach is similar to G3 in that it attempts to create a probability distribution over possible groundings and then utilize that to ground new examples, thus updating the distribution. In any case, Matuszek is again the most different in its approach (mostly because of the differing goals, object classification v. action planning).
All of the approaches to some degree attempt to recude the necessity for training data. The Branavan approach begins with no assumed a priori knowledge and their system is able to perform decently. It is then augmented with annotated actions which bolsters the performance significantly.
In Branavan et al there isn't exactly a direct represention of word meaning like we saw in some of the other papers. In G^3 we could easily point to the lambda values which represent confidence in some word being grounded in some real world object or action. This RL approach gets away without a direct representation of meaning by skipping straight from words to action. The goal is to learn a mapping from a series of sentences (themselves being lists of words) to a series of actions. We might say that meaning is represented in the parameters that define this mapping but the meaning of a particular word doesn't exist in a concrete fashion discretized in this paper.
ReplyDeleteThe goals are reminiscient of Chen and Mooney and Tellex et al because they see language as the having the goal of causing some particular actions take place in the world. It differentiates itself from Chen and Mooney and Tellex et al in that it doesn't actual attempt to make any structure out of language. When trying to map sets of words to language it simply attempts to align substrings. Both Chen and Mooney and Matsuszek on the other hand use a semantic parse and Tellex et al parses and constructs a grounding graph. This naturally leads to questions that perhaps the Branavan et al method could be improved if it had a more structured understanding of language other than as a flat list of words.
The parameters for aligning word sequences with actions is learned through reinforcement learning, which is naturally suited to this problem because they chose domains with clear reward structures. In the tech support example they simply need to feed it 18 help documents from the Windows help system on which it performed a gradient descent to try to find locally optimal parameters. Once learning from this set it was able to handle various other test documents from the Windows help center.
The main strength of the paper is the lack of labeled training data required. Although even though it doesn't require a corpus of labeled data, it still requires a reward function to be carefully designed. This is mainly possible because it could be trained using reinforcement learning. In a space where goals and rewards are less clear, then we might not be able to use this methodology. Also since it was a virtual domain, performing a great number of RL iterations isn't unreasonable, but doing so in a real-life physical domain might not be feasible if we don't have time to watch a robot struggle through many many iterations in order to learn parameters. That said, we often need robots to perform particular actions for us that have clear goals and this method seems to be well suited for doing that without an onerous amount of labeling intervention.
For clarity, I refer to papers with the last name of the first author only.
ReplyDeleteOut of all of the papers, Matuszek is perhaps the closest to actually assigning semantic meanings to words. Chen seemed to be doing something similar for recognition of landmarks, but otherwise is trying to connect words with actions directly. The approach detailed in Branavan, for example, would probably be unable to create language on its own because it’s based on its mapping of words to actions and not interactions of the words with each other.
I suppose a clearer distinction between the approaches is the fact that Matuszek directly incorporates compositional semantics that allows it to tie sensory input to predetermined digital features. Because it is used in a severely limited scope, this is possible. On the other hand, Branavan maps words directly to actions. The composition order of the sentence is not as influential. Thus, it probably cannot process relations between objects. However, Kollar would be able to as it uses relational features to understand relationships between objects.
Branavan’s approach was similar to using an MDP. The reward function was situational and dependent on the application case. Typically the reward function would need to be specifically pre-determined since partial completion of tasks is too difficult to judge. For example, only the end state for the technical instruction and puzzle cases is easy to verify. Matuszek relies on training classifiers for visual features after parsing compositional semantics for each input command while Kollar and Chen rely on probabilistic methods for creating features.
In terms of training data, Branavan’s novel ability to incorporate documents with and without annotations for training made it very flexible. While it was possible for Branavan’s to use solely environmental rewards, most of the other papers required some form of annotation.
For example, Chen required representations of virtual worlds in additional to annotated instructor and follower data for training. Matuszek relies on a completed semantic parser in order to know which words need connections to features.
The papers all involve sensory data or external inputs. Branavan typically used textual or modeled input from the system in question. In the Operating System case, the end result was to read words on the screen. For the puzzle, this involved understanding the puzzle model. In these cases, external input was typically for reward function calculation.
On the other hand, Kollar, Matuszek, and Chen used sensory input for understanding commands through assignment of features to words. For example, this occurs in both Kollar and Chen when the algorithms attempts to recognize landmarks during command interpretation. Matuszek is similarly using sensory input to assign features to words. Branavan does not do this.
In this article, the authors use reinforcement leaning to map natal language to "sequences of executable actions". They used the validation process as the main source of supervision to guid learning. By given a set of document, the machine construct action sequence for these document repeatedly and observe the reward. By using a policy, the machine can get the maximizes future expected rewards . Authors let the machine map the document and the sequence of action the documents express. An action contains a command, a parameter of command and a set of objects which are available in this current environment. The environment states changes according to the command and the parameter of the command. The reward function can output a score which has a connection with the correct action selection. When the the history of states and actions which is visited when interpreting a document matches the annotation for the document, the reward function should return one, otherwise, return zero. By using those, Machine can map the sentence in a document with an action. By repeatedly choosing an action which is given the current state, the machine can use this action to advance to a new state. This means the machine can predict a sequence of action. By using the given current environment state, document, the words that were mapped by previous action for the same sentence and index, the machine can predict the next action. Because the possible next action space is defined by the unused words in the sentence and the possible command and parameter in the current environment. Since the natural object can maximize expected future reward, so authors used the estimated parameter of policy to maximize the value function. By using the maximize the value function, we can get the maximize the objective. Then the machine can use the word in each sentence to mach a object in the environment. If the word of sentence can mach a current environment object, then the history of reward is 1, if not, that means the previous sentences were analyzed incorrectly , so the history reward is -1.( I have a question here, if a word in the sentence can not mach a object in the environment, how can machine know which previous sentence was analyzed wrong? or, it does not matter which sentence is wrong, the only things machine need here is to get the history reward?)
ReplyDeleteDuring the training, we need to produce a set of document, transition distribution and reward function.The goals of first training is to get the estimate parameter of policy p which is the action selection distribution. The goal of the second training is to get the optimal policy. The required data is the history of states and actions while interpreting a document and the parameter. The other data which are need are actions's command, the parameter of the command, the objects which are available in the current environment states and the environment states.
Word meanings are represented as a set of actions in a Markov Decision Process used in reinforcement learning. This is significantly different from Chen and Mooney and Tellex et al. whereby they don't seem to be learning a semantic parsing of the language. They simple take words in a sentence and map them directly to actions and parameters in an environment.
ReplyDeleteWord meanings are learned through reinforcement learning, where actions are a tuple of the command, the command's parameters, and the words describing the command and the parameters. States are tuples of the current environment state, the current document, the index of the sentence in the document, and the words that were mapped by previous actions. During training words are mapped to commands, and when the correct actions are taken, reward is given to successful actions.
A sample of documents, including windows troubleshooting guides and puzzle game tutorials are provided as training data. The RL learner is given the documents which are broken into sentences. Tutorials can be simplified into a set of commands, as the entire document instructs the reader to do specific task.
A transition distribution about which actions are possible given an environment state that can be sampled from, and a reward function are provided as input to the system, especially used for training. The environment is used to model the transition function. It's a set of interactive objects and their properties. As different objects are used, the environment state is altered to the specific actions available for that object.
This system simplifies language into commands, and assumes that actions or parameters of actions are the sole object modeled by language. This limits the system's practicality to very command heavy texts. Extending the capability of this system to dialogue and interaction may not be feasible with the current language representation.
This paper introduces a reinforcement learning framework that induces mappings from text to actions without the need for annotated training examples. In order to guide learning, leveraging the validation processes is considered as the main source of supervision. Due to the lack of human-created annotations, sometimes standard supervised techniques are not applicable but this supervision is still able to learn interpretations of natural language instructions in this case. During training, the learner repeatedly constructs action sequences for a set of given documents, executes those actions and observes the resulting reward. A set of documents, the ability to sample from the transition distribution and a reward function are provided during training. All of these papers are trying to build a system that does not rely too much on the training data. However, in Matuszek’s paper, it uses visual perception to ground word meanings in the physical world to learn representations of the meanings of natural language. Learning is performed via optimizing the data log-likelihood using an online, EM-like training algorithm. This system is able to learn accurate language and attribute models for the object set selection task, given input data containing only language, raw percepts, and the target objects. By jointly learning language and perception models, the approach can identify which novel words are color attributes, shape attributes, or no attributes at all. In Tellex’s paper, G3 knows word meanings and infers the type of grounding variable by mapping noun phrases to objects, prepositional phrases to places and verb phrase to events. G3 framework defines a probabilistic graphical model dynamically and it can understand word novel commands not present in the training set. For Chen's system, it has to learn the meaning of every word. Natural language instruction, observed action sequence and description of current state are given to Chen’s system.
ReplyDeleteIn this paper, the policy is modeled in a log-linear fashion which is able to incorporate features of both the instruction text and the environment. In order to predict a sequence of actions, this sequence is constructed by repeatedly choosing an action given the current mapping state, and applying that action to advance to a new state. A policy gradient algorithm is employed to estimate the parameters of this model. It uses stochastic gradient ascent by computing a noisy estimate of the expectation using just a subset of the histories. In this approach, models trained only with simple reward signals have a good performance and augmenting unlabeled documents is able to reduce the performance gap though with a small fraction of annotated examples.
In the paper by Branavan et al, word meanings are represented by their relationship to objects on the screen or actions to take. They have no other abstract representation, but are just used as proxies for objects and actions in the physical world. The meanings are learned using a reinforcement learning algorithm. Based on the history of actions taken and states visited, a reward is assigned that provides information about whether or not the sequence of actions resulted in a favorable outcome. Since there is no way to solve for the exact parameters that optimize actions, they use a policy gradient algorithm to estimate the thetas.
ReplyDeleteThey used and compared several different types of training data. The most unsupervised method was one without any annotated data, which relied solely on reinforcement learning. Using only the given reward function, the machine learned the optimal policy function. Then, they tried incorporating partially annotated data with the reinforcement learning. This outperformed the un-annotated version, but only slightly significantly. Finally, they compared their algorithm to one with all annotated data. Using fully annotated data performed the best. However, this is a tradeoff. Clearly, if as successful, it is an advantage to not rely on annotated data, compared to the other versions and approaches discussed last week. Creating large annotated data sets is an obstacle to expanding the robot’s functionality to other domains. Using a partially annotated corpus seems like a good compromise. The graphs in figure 5 of how the number of annotated corpus affects the outcome show that at all levels using reinforcement learning improves the performance of the machine. Finding a compromise number of annotated documents to minimize difficulties in acquiring the corpus and yet still giving the benefits of annotation is a good solution. The ability of this algorithm to work without annotation is a positive as compared to the previous methods in the papers by Chen and Mooney and Tellex et al.
The other important input is the reward function. In the Windows domain, determining the goal state is difficult so they define their reward function by checking whether any words in the next sentence correspond to objects on the screen. This a noisy definition since taking the wrong action could happen to result in matching objects appearing on the screen. For the game Crossblock, it is easy to recognize the goal state (since no blocks are left), and therefore all other states are assigned -1 and the goal state a positive value based on how many of the words in the command were used to get to the goal. Creating the reward function requires human intervention, so is a drawback to the method.
In Branavan's paper, the word meanings are represent in the way that mapping text to actions. Sentences in the input document are divided into sequences of words. Unlike previous work that learn from parallel data and perceptual cooccurrence patterns, this apprach is learned through proactively interacting with an external environment. The method interprets documents as sequences of words and their parameters. As the paper points out, the Branvanan's method uses a reward function that supervises the quality of executed actions to replace supervised learning techniques. What is more, it use a policy gradient algorithm that learns withna small subset of the states. The combining of these two methods will ensure the method to be both accurate and efficient.
ReplyDeleteThe Matuszek's method does not require parallel data and then learns through EM algorithm. The Branavan's method does not require parallel data either. The learner learns from constructing action sequences of given documents and executes them, then observes the resulting reward. In learning, we find the optimal policy. That is why the method is labelled as "reinforcement". Every time the learner observes an action, it will apply it to a reward function. Our goal is to maximize the value function. A policy gradient algorithm is used to optimize the objective. The reward function assigns positive number to a state if the execution can proceed from one sentence to the next, -1 if not.
The training data for this method includes a set of documents, the ability to sample from the transition distribution, and a reward function. The set of documents provides us the commands and their parameters. The ability to sample from the transition distribution helps the learner to estimate the action selection distribution. A reward function helps to find the optimal functions. Other than training data, we also need the environment state, that is the set of objects available for interaction and their properties as input.
The method is similiar with Chen and Mooney's paper in their training methods that the robot observes the actions and results and use it to train. The Matuszek's method maps the natural language sentence to a set of scene objects, not including verbs. This method takes in verbs and their object as parameters. The training data for Chen and Mooney's method includes natural language instructions, observed actions, a description of the current state. The Branavan's method does not represent an individual word directly to an action. Instead it maps a document to a sequence of actions and predicts a sequence of actions. The robot constructs the actions dynamically choosing an action given the current mapping and take action to move to a new state.
On the other hand, this method is similiar with the G3 methods since they combine verbs meaning with their object meaning. The method uses an action a = (c,R,W), c for command, R for command's parameters, W specifying c and R. The R includes objects available in the environment state. The G3 method takes in data and use a semantic parser to analyse verbs and nouns separately. This may increases the accuracy of the mapping.
Branavan’s approach focuses on binding every word (or sequence of words) to an action (although some actions may be null). Pieces of sentences are paired with actions on objects (the object on which the action is occurring is also extracted from the sentence using knowledge of the current environment). Meanings are learned by trying various actions given a current sequence of words of a sentence. A rewired function determines the effectiveness of the current action. After multiple iterations of this, the most rewarding actions given a sequence of words are determined. The training data is supplied in the form of documents describing actions and a reward function over state-action pairs. The additional input comes in the form of the current environment, as to allow for identification of objects in the world when actions are performed on them, as in “Click OK”. This combined together uses RL to determine the best action to perform given a sequence of words, allowing the accomplishment of tasks specified in well-documented domains.
ReplyDeleteThis model is rather similar to the UW approach in that it applies operators to objects. Applying certain actions to objects in Branavan’s method closely resembles the way in which the UW system applies descriptors to objects. The major difference here is UW’s approach had no actions (although the system seemed to be capable of extending to allow them) and Branavan’s appeared to have no descriptors.
Branavan’s approach also shared some similarities with Mooney’s in that it pairs action sequences with word sequences. The major difference here was that Branavan’s approach using rewards based on final results to learn action sequences, with the actions allowed already defined, while Mooney’s trained on action sequences paired with sentences, but the actions don’t seem to have been pre-defined.
This approach is most different form G3, sharing mainly the fact that they determine actions from natural language sentences. They also both begin with predefined actions, and learn how to map natural language to those actions. Beyond that, their methods of creating action sequences seem to be fairly different.
Branavan’s approach is an interesting way of natural language processing, but it seems fairly restrictive in domain. It depends upon set actions (which, to be fair, so do many other NLP programs), as well as a reward function. It also requires knowledge of all objects it can act upon and doesn’t seem to handle descriptors. This is a tradeoff that is made for effective use of RL in this domain, and does show excellent results.
Branavan et al. map words to states and actions of an MDP. Given instructions, the MDP maps some words to states of the environment and other words to actions of the model. Training the MDP requires documents of instructions and reward functions that guide the model. Using the gradient ascent algorithm, the model maximizes the expected sum of rewards. The model can perform better if it is given the documents with action annotations.
ReplyDeleteA big difference between Branavan et al.’s model and models of other authors is that the model can be trained with the training data without annotations. At minimum, it only requires natural language instructions and goal states of the instructions. It is cheaper to train than other models previously we discussed. Providing partial/full annotations to the model leads to the improved performance of the model. We can have a system that works with minimal training data and we could improve the system by providing some annotations.
Another difference between the MDP model and others from previous papers is that people need to choose “good” reward functions. Other models don’t require reward functions. They can be trained on the training data only. One problem with the reward functions is that it is not so clear how to provide “good” reward functions. If it is given bad reward functions, it may do something that designers don’t want it to do. It may behave incorrectly even if it is given “good” reward functions that designers carefully have chosen.
The MDP model works in a more restricted domain than other models from previous papers. If the MDP has too large a state space, then it may take too long to train the model. With a limited set of vocabularies, the model may have problems with handling unseen words that are crucial to given instructions. It may be hard to generalize this model to other domains where restrictions should be loosened.
In the Branavan et al paper, words are scarcely represented – instead, they are mapped directly to actions and objects in the environment (e.g. a sentence containing a command with the term 'run' maps to an object in the agent's environment that contains the keyword 'run'). Thus, the system is not designed to accomplish word representation in the conventional sense. As mentioned, however, sentences are mapped to actions and objects in the agent's environment (e.g. the Windows OS) through two different approaches. Both approaches employ a classical reinforcement learning mechanism, with varying amounts of prior knowledge. The first approach includes a small set of annotations in which the correct action is defined, so that the agent may calculate reward by comparing the chosen action to the correct action. The second mechanism learns simply by using direct feedback from the environment as the reward function in the RL framework. This is accomplished by (noisily) assessing whether or not the next sentence/command may be completed based on the current state of the environment. If we accomplished all of the proper tasks in the previous steps, then the environment will be in a state that allows us to continue with our task. More broadly (for both approaches), a policy is learned by maximizing a value function (akin to the one we saw in the pair of POMDP papers). The only 'training' data required is basically the reward function. This is easily the most impressive part of the paper, in my opinion, as they are able to achieve quite a lot with a minimal foundation. However, this seems a direct consequence of the fact that they aren't really attempting language understanding in the classical sense – they are instead mapping instructions directly to actions (or associated objects), where the environment that corresponds to actions contains some linguistic information (namely, R,) correlated with the actions (e.g. the button 'Run' can map to any sentence containing '{r|R}un' in it).
ReplyDeleteAs mentioned above, the most significant difference between this paper and Chen/Mooney/G^3/UDub is that the semantic and syntactic content of the sentences is essentially irrelevant here – the only salient features of a given utterance are those that enable the system to map them to a command or object, which in this case, entirely excludes sentence structure. Additionally, the component being learned is the policy, which differs quite a bit from past papers. The final difference is that this system relies quite heavily on the reward function (while the previous PLG models do not), which, as we mentioned yesterday in class, is a critical component of any RL system.
Branavan's paper takes uses RL based algorithms to learn mapping between NL instructions and actions. This seems close to Chen and Mooney's approach also because the state of actions is limited as was in Chen and Mooney's case because of the operational language, but it is different in the case that the observations of words are being represented in the transition probabilities between states. So there is grounding between actions and keywords like the Kollar et al. and Matuszek el at.
ReplyDeleteThe tricky part for me is to see what happens if the set of actions that we can do becomes considerable larger. In this example it is an interface on the computer, which means there are not that many actions to perform (left/ right/ double clicks and type), but the state space can be larger. This does not compare to the Matuszek paper as its action space was only to match items or not or the Mooney paper where the set of actions was again limited to operational commands. In the Kollar et al. paper since it uses supervised learning the set of actions can grow and there is no search over previous histories to look for the action to be performed, but a training example that we can follow. Here each growth in action space would complicate the problem further so I wonder how to expand its horizons to a word representation problem with larger action space. The question basically is whether supervised learning is a better method to solve the problem of representation than reinforcement learning, as admittedly Branavan's results do improve with feedback.
Another big question would be the sort of reward function to be used in this case. The rewards seem hand shaped and extending this to robot based scenarios might not work, as with each different scenario we might have to hand tune the reward function, as has been done for the two scenarios in the example, it might be easier and "fool"-proof to get supervised examples. To me the fun part of this paper was using the policy gradient based method to learn better policies using previous actions, used in the set up of language learning. But I am still not sure if it can scale to larger scenarios, but I do not think comparing RL with supervised learning is fair because both are to be used in different scenarios, but in the case of language I believe there is a large enough corpus that can give better results.
In Branavan’s work, similar to Chen & Mooney’s case word meanings are directly tied to discrete commands and objects in the world. Training methods in both approaches are different in that Chen & Mooney’s work uses the Lexicon Learning algorithm to find the common meanings between n-grams and graphs structured by formal language inputs whereas Branavan’s work uses reinforcement learning and the policy gradient method to track down this problem. In Branavan’s work, the parameters to compute a probability of taking an action which means choosing a word, command, and parameters, given a current environment, document, sequence in the document, and the words that are already mapped are learned through the above mentioned algorithm. It iterates through chosen action series of each document computing the rewards and updating the parameters until it asymptotically converges.
ReplyDeleteIf we think of what training data to be used, although the algorithm itself does not require annotated data from words to actions in its basic setting, it actually helped to perform better in the test result. In this approach, a world with constraints and a set of documents with instructions are given and successes of following such instructions are computed automatically. Therefore, unlike three other approaches we dealt earlier, the system itself is self-contained. However, when there are no pre-annotated documents are given, the performance was not as good as the annotated cases. 20 to 30 percents of annotations actually helped to perform as good as the fully annotated cases. The annotated data is not given to explain a specific action, it can mean various actions and only its success can be computed automatically.
Branavan’s approach is only applicable if we have a world that we can extract rewards of giving an action automatically or without human inputs. If there is no feedback from the given environment, a full supervision is needed for better performance. Examples of such feedbacks are when a series of action is taken and there are no objects for the next action when it requires one, a negative value is added to the reward. Such inputs can be given to other situations, where robots manipulations result in failures to complete actions by making objects fall or brake. In other three approaches, rewards for completing a series of actions can be given since you can measure a success or a failure of a demonstration conducted by a robot or an agent. Therefore, environment feedbacks in this approach play an important role in training performance.
In conclusion, this approach has its strong point in that it does not require a corpus which is fully transcripted in formal languages like Chen & Mooney’s case while it is learning formal instructions of a given text.
In Branavan et al.'s paper, the meanings of word sequences are mapped to actions that consist of commands and commands' parameters. The learner tries to learn a policy that chooses appropriate action given external environment state and natural language instruction. The policy is represented as a set of parameters which indicate the relative weights of different features in the log-linear model. The policy is learned through reinforcement learning, in which the reward function is given by the environment feedback (i.e. whether the task indicated by the instruction is actually finished), or, under the circumstances that environment is difficult to verify, given by other reasonable measurements, e.g., whether the generated action sequence is executable. The training data required consists of a set of natural language instructions and corresponding environment settings, and when available, annotated data used as supervision. In this case, the reward function can be combined with the supervision data (i.e., whether the action chosen by policy matches the annotated action).
ReplyDeleteCompared to the three previous papers, Branavan et al.'s paper has the least dependence on supervision since the reward function can be constructed in the absence of annotated data, while other three papers require different forms of supervised training data. However, as is showed by the experimental result, the use of annotated data can make a significant improvement on performance, so it's actually a trade-off. On the other hand, in the absence of annotated data, the performance of the learner would be affected by the quality of reward function. In the Microsoft Windows help and support case, the reward function does make some sense, since if the execution cannot proceed then the task can definitely not be completed, however, the proceeding of the execution does not necessarily means the target is achieved. So I was wondering how much can be gained on performance if more meaningful reward function is designed.
I'm going to largely agree with Dave here: word meanings are only represented in Branavan's work as mappings to actions taken by the agent with no attempt to assign some internal meaning. This is in starkest contrast to the Tellex and Matuszek papers, which learn actual features through which to understand word meaning. Furthermore, while they emphasize in the paper that their semantic structure is fairly diverse (Section 7), they don't use this linguistic structure regardless. Rather, their parsing is based purely on keyword association, and their dataset is extremely limited in this regard. Windows tutorials use specific button names as word objects (Run, Okay, Internet Options menu, and so on), greatly simplifying and enabling their keyword-based approach.
ReplyDeleteThese associations are learned through RL techniques; they learn a policy which maximizes reward through an action history. Some of this learning occurs through supervision, and Branavan compares various levels of supervision. Because of the RL approach, learning also takes place through a reward function. This requirement in some takes takes the place of the supervision data used in other (Mooney, Chen, Matuszek) approaches. Branavan's approach requires comparatively very little data, but instead a cost function must be designed for the problem domain. This serves them well in their domain, but the approach does not scale to other environments.
Through this methodology, they manage to achieve decent results with very little training data. Most of their training occurs through a carefully selected (albeit an impressively naïve) reward function. They further emphasize their result that a very small amount of training data dramatically improves their agent's performance. However, this rather highlights the simplicity of their keyword-based nature of their problem set-up; they have a simple environment where they are able to achieve rapid convergence to the optimal policy.
I think these combination of factors point to the fact that Branavan's approach is domain-specific but fairly powerful. They achieve good success with very light data requirements. In that regard, their technique is fairly useful. For a problem that fits their requirements (keyword-based semantic parsing, fixed action space, ease of designing reward function) their technique is very applicable. However, because of the cost function and the lack of semantic understanding, it seems that generalizable results are better achieved by Tellex and Matuszek.
Word meanings in Branavan's paper are represented as excutable actions. He uses reinforcement learning to learn mappings between documents and the sequence of actions. This is more like Chen&Mooney's approach which interprets navigation natural language instructions to actions. And G^3 model maps physical representations or groundings to the word meaning according to the compositional and hierarchal structures of language, while Matuszek's joint model of language and perception learns representations of word meanings for grounded attributes.
ReplyDeleteThe goal Branavan's approach is to find the optimal policy which in turn leads to estimate the parameters of the action selection distribution for the parameter that maximizes the reward can yield to best actions. In the learning process, the objective is to estimate the parameters that maximize the value function. This parameter is augmented during learning by reward functions. Reward functions is defined based on whether annotated data is available. If every training data is annotated, the reward function is straightforward. On the other hand, reward functions can be defined by environmental feedback which correlates to the quality of action sequence. So this leads to the biggest difference of learning process among this approach and the other three approaches. This approach doesn't need annotated training data for supervised learning while the other three approaches use the data more or less. It guides learning process by validating the correctness of actions. And the result show that the environment-based reward function can provide strong guidance for learning.
After the parameters been estimated during training, most accurate action can be selected by calculating the maximum probability of all possible actions given current mapping states and the parameters as fixed. The training data Branavan's approach uses is a set of natural language instructions, command, parameters of commands, environment state and features that capture various aspects of the action. In Chen&Mooney's approach, training data is natural language instructions and the corresponding action sequences. It tries to find the inner navigation plan of action sequences mapping to natural language instructions. It doesn't need any prior linguistic knowledge, while G^3 Model and Matuszek's joint model both use annotated data collected from AMT and some additional input like parse trees generated from Stanford Parser in G^3 Model and logic forms and vision classifiers in Matuszek's approach.
In general, without annotated data, Branavan's approach can behave quite well to predict an action sequence mapping to the natural language instructions. But it needs environment-based feedback as reward function to learn the parameters in a log-linear model. The accuracy of action sequence depends on the parameters which in turn based on this feedback. So if environment can't provide feedback information which can check the quality of action clearly, this reinforcement learning approach may not work so well. So reinforcement learning heavily relies on the correctness of reward function.
In Branavan et al. word meanings aren't explicitly represented. Rather, their
ReplyDeleteassociations with commands and parameters (their alignment) is inferred through
a reinforcement learning process. This is similar in some ways to Chen & Mooney,
who also map from commands to actions -- though they do go through an intermediate
semantic parse. The closest thing you could point to for a meaning would be the
association between phrases and commands (or their parameters) -- the thing
that has to happen for the words to be fulfilled.
The reinforcement learning was run with varying levels of supervision. Unsurprisingly,
the more supervised the task, the better it generally performed. As I understood
it, their system allows for significant flexibility in the "level" of supervision
by changing the reward function's knowledge of what progress is. It's not entirely
clear to me how this worked in the partial supervision case: how is the reward
function integrating the supervised examples?
The performance of a minimally supervised approach such as this is pretty
surprising. At the same time, there's a lot of "input" in the reward function
which isn't traditionally supervised, but nonetheless makes an enormous difference
in performance (indeed, when supervision is introduced I think it's through the reward
function). I'm not sure how well the naive "you got it right" approach would
work in even larger state spaces. Granted, the branching factors on both
of these tasks is relatively large (~10) -- but there aren't too many steps, and
the amount of *information* required for each action decision is much
less than log2(10) bits. This is made clear by the marginal success of the
majority baseline. Not to mention that determining whether the agent actually
did what it was supposed to can be challenging in the first place.
As Andrew mentioned, there's nothing about the reinforcement learning which
requires such a simple view of the language. It may well be that this is as
performant as anything else, but it makes me curious to see if something
more structured could do better. It might, for example, be able to leverage
the substructure somehow in learning from rewards, or simply be able to handle
a more diverse set of linguistic inputs.