This week we will read a pair of papers about POMDPs, Partially Observable Markov Decision Processes. This framework has been used for robot planning and perception, as well as spoken dialogue systems.
Post by 5pm on Sunday about 25 words answering the following question:
- What are the challenges in using a POMDP model to drive a language-using agent? How could that challenge be overcome?
By 5pm on Monday, post a reply to someone else's blog post. Suggest an alternate solution, ask a clarifying question, or point out something they might find useful.
One difficulty in using a POMDP is that the agent does not know which state it is in. Instead, it has a belief state based on a probability distribution over the possible states. This makes reasoning about the optimal action more complicated since instead of just having to compute the best action for the current state, it has to compute the values for all possible states. A possible solution is to use plan graphs, which do not require any explicit representation of the belief state. As demonstrated in the paper, often a plan graph can be simplified by eliminating the nodes that are never reached from the initial state. Furthermore, the size of the optimal plan graph increases with uncertainty about the world. Therefore, much more memory is needed to store the graph. One way to reduce the cost of computation is to implement probability threshold for the possible states. Instead of evaluating the policy tree for all states, disregard the states with a probability lower than the threshold and only consider the more likely states.
ReplyDeleteAnother challenge is that not all POMDP’s converge to a finite optimal infinite-horizon value function, and as a result can only be solved approximately. The tiger example from the paper does, and therefore value iteration solves the problem. For POMDP’s that are not finitely transient, a solution could be approximating the model with a finitely transient policy, again disregarding the unlikely nodes, and then using value iteration.
I think the largest challenge raised by using a POMDP model is the computational requirements of doing value iteration for every possible state that an agent believes is possible. Luckily, language-using agents have a couple interesting ways of pruning down this computation. Lauren already mentioned that one way was to prune the states for which you perform value iteration by ignoring any belief state with a belief value too low. Language-using agents can expand on this by using language to communicate with humans and ask for clarification when the search space would be unreasonably large. POMDPs provide the perfect interface for asking clarification, since they already have some probability of belief for all possible states, this can simply be updated by new information. The part of this kind of pruning that I find truly interesting is the ease at which the language-using agent can treat human provided information as flawed or partial. For example, an agent searching for an object in a house could ask a human for the object’s location, and could receive varying replies with varying correctness, such as “By the front door”, “Upstairs”, or “In the living room or kitchen”. For each of these responses, the agent could simply increase its belief in the states for which those statements applied, allowing more general directions to be easily incorporated in the agent’s action.
ReplyDeleteIn the paper, the state may transistor within 10ms as human hand changes states quickly. Comparing with it, it is slow to calculate that. It is useful for robots to ask for more information. I was wondering how should we determine the states with quickly changed states with robots asking questions. If it keeps on asking about states but the states have changed?
DeleteThat's a really cool point. I imagine it's in a pretty good position to know when to ask too, since it must know the states and actions it will enumerate over. It would be interesting to formalize this into a framework which took the tradeoff of human time & knowledge / robot expense into account.
DeleteThere are a lot of ways that actions could be improved based on the current state. For example, a less aware patient may require stronger prompting. While the POMDP model has the potential for responding to all these situations differently, it does not provide an accurate method of doing so.
ReplyDeleteOne way to solve this is to create another model for output, but having it efficiently use all factors may be difficult. We could combine a language model with a learning algorithm that learns via the favorableness of actual outcomes. In other words, the favorableness of outcomes from the current POMDP model would help drive an output language synthesizer which in turn drives actions for the current POMDP model. It’s possible that this sort of circular approach could magnify mistakes in initial learning. So creation of the original states, variables, and their probabilities should be independently set up first.
Because the task is limited and so well-scoped, we can use a model with a small word library such as SHRDLU. However, SHRDLU’s method of processing through memory is inappropriate for this task and context should be acquired from the POMDP model instead. How this works exactly is unclear since POMDP works mostly by probabilities. However, we can make a distinction between state and action, allowing for this difference to stand. Even though POMDP is using probabilities to determine possible states, SHRDLU could take one or two of the most likely states in addition to any variables, converting them into an action.
I think using POMDP has two challenges. The first is what is a belief state. POMDP is an agent which can not observe the current state. It use action and resulting state to make an observation. In POMDP, it has a belief state which can summarize the agent's previous experience. In order to get the the action effectively, agent should have a degree to judge uncertainty. In order to solve this problem, people give the agent a current believe state and no additional past data can increase the agent's expect rewards. As long as the agent does not observe the final state, it always has some none zero belief.
ReplyDeleteAnother challenge which I think is the biggest challenge is finding an approximation to the optimal value function. According to MDPs, if we can get the optimal value function then we can get the right optimal policy. In the MDP case, it used value iteration to get the value function, so authors of this article also used value iteration and "exhaustive enumeration" to get POMDP's value function. However, this algorithm is not efficient, it still does some extra work than necessary. So, authors of this article used a new algorithm which is witness to solve this problem.
POMDP agent cannot observe the current state. It makes an observation based on the action and result state. As in Leslie’s paper, a weakness of the methods in the paper is the state representations of the world. They require the states to be represented enumeratively rather than through compositional representations. Memory of previous actions and observations are necessary to reduce the ambiguities of the states of the world. The state estimator should keep updating the belief state which summarizes its previous experience based on the last action, the current observation and the previous belief state accurately. If the process is long, the time complexity and memory complexity may be very high. Moreover, a POMDP agent must map the current belief state into action. Another challenge is how to make the mapping accurately. They approach the problem using value iteration to construct to compute the optimal value function, then uses it to directly determine the optimal policy to map belief state into action. In Hoey’s paper, it says the mapping process requires a form of temporal abstraction. Also, like the example in that paper, since record of handwashing operations can be very fast(a new hand location estimate can be made every 50-100 ms), it should not update the state of the POMDP that quickly because the time scale at which prompting actions themselves occur. When the policy should be consulted in the POMDP and when the belief state should be updated are also challenges for POMDP, and in Hoey’s paper they adopted a simple heuristic to decide(updates belief state in these two situations: when the hand position change will cause a significant change in the belief state or the belief state has not changed for a long time(timeout)).
ReplyDeleteThe POMDP model have 4 parts: a finite set of actions, a stochastic transition model, a stochastic observation model, a reward. The system state is not certain. We need a policy to map belief states into choices of actions. The POMDP needs to decide what is the next action based on current state.In Hoey's paper,the variables determining states includes: task, attitude, book-keeping variables. However the states cannot be observed exactly. Thus the challenge for using POMDP stated in the papers is to find an optimal policy to map belief states into choices of actions.
ReplyDeleteWhat is more, the states in POMDP may be infinite, which makes the value functions expensive to compute. The Leslie's paper put forward a method to solve POMDPs exactly in some cases. First, use policy trees to find belief states, the writers use the convexity function of optimal value to find the best solutions.And then set value functions as sets of vectors and get rid of values that are unimportant or dominated by other policy trees. The witness algorithm As the belief of each action to get information from environment decreases, the algorithm needs more steps. The witness algorithm simplified the computation to some extent but it is time-consuming for larger POMDPs. Using the approximation method is a better solution.
POMDP is an agent which can not observe the current state, however, it can observe based on the action and resulting state. In the acting optimally of MDPs,it mentioned stationary policy π:S->A which is a situation mapping an action to be taken in each state. The policy π here is responsible for engendering actions. According to Kaelbling, Littman and Cassandra's paper, they used π as the policy to map belief state and action. The difference between PDMDPs policy π and MDPs' is PDMDPs' π is responsible for generating belief state rather the state of the world.
DeleteIn Kaelbling, Littman and Cassandra's paper, they mentioned that one advantage of using witness algorithm is faster in practice over a comprehensive problem sizes. However, although witness algorithm make computation easier, it's time consuming in large POMDPs. Are these two views contradictory? Why one experiments certify that witness is better and faster, but another said this algorithm is time consuming?
I guess that the experiments certify that witness is better and faster is based on the purpose and size of the experiments. For experiment with less environment variables for POMDPs to be based on, it is faster and better than other algorithm. While in case there are 15 to 20 elements, the size of the function grows quickly. Is that a possible answer?
Delete
ReplyDeleteFirst off, I worry that I'm missing some nuance to this question, as POMDPs seem to be a model of task/action planning and ill-suited to language use. Such an agent could use language as sensor and/or actuator, but that's a detail tangential to usage of POMDPs. With that said, I'm also we're talking about modeling a 'real-world' agent, and always in modeling the real world, the challenges are almost endless. There's the problem mentioned by several already, that normal algorithmic solutions (VI, etc.) aren't tractable. The optimization function is constantly changing, and not constant even to an individual let alone a population. A simple example: say the robot is low on battery, it's cost function should change from its current task so that it's first priority is to plug itself in and re-charge. There are value-estimation problems: how valuable is it to take a given action? There are state-feature problems: which exact state-action pairs do you update? If the world were truly, fully parameterized, you'd never see the same state again and never learn anything. Additionally, we'd like to learn things about objects in the world rather than states. We'd like behavior from a state where object A sits at (x,y,z) in the world to teach us about object A, not that world-state as in the classical POMDP setting. In some cases these descriptions point towards solutions, even if the implementation of such isn't clear. In the case of mentioned object A, we'd like to learn features of object A that inform our current state, rather than a purely parameterized description of the state. Some problems seem somewhat more systemic to POMDPs: non-fixed costs and cost estimation in particular. I've generally thought of POMDPs as a fairly powerful framework, but listing through these reasons makes me more skeptical about them. Something more goal-oriented (goals and subgoals, etc.) rather than transition-oriented seems much more appropriate. Coming up with one rough sketch of a fuller description below *, but I look forward to hearing alternate opinions. :)
* Description: perhaps some kind of goal-action space (cf. state-action space), where for a given goal, there are a set of actions that can be taken. Goal is specified independently of state, although there is perhaps a function satisfaction(goal, state): how much a world state satisfies a given goal. Effectiveness of an action is effectiveness(goal, action, state), with action-selection is the action argmax of that function at the given (goal, state) tuple. Maybe my key argument here is simply: in a given state, humans can have many different goals, so that should be set up as a different parameter, rather than encoded into a global value function. Note this parameterization doesn't address the issues of object generalizations across state or tractability (indeed, makes it worse!).
I think the feeling that I had with my initial read of the paper is similar to yours, where it seemed like the POMDP was more of an indirect approach to the language problem. Instead it seemed to just model the state of the world as it changed, using language as inputs and outputs.
DeleteHowever, the paper didn't seem to explain the exact prompts, also making me believe that the direct problem of creating and understanding language is dodged or fleetingly addressed. It seems to me that POMDP can handle language inputs in a similar fashion to visual inputs while output actions seem to be severely limited. Given the probabilistic states that the POMDP seems to hold, perhaps it's possible to use another algorithm for language output?
I think the modeling of the state space poses a really significant problem in real world applications. I suppose a good way to tackle that would be to limit the representation to significant factors, but you still end up with the problem of seeing the same state with one object in a slightly different position. You could limit the state space even further by having a state be representing by some sort of similar layout, where each state represents seeing a similar configuration of objects, rather than exact objects in exact positions. (So seeing a cup at the center of the coffee table vs. a cup a few inches from the center were mapped to the same state, and then actions performed the minor adjustments necessary to act with the slight differences.). The problem with this is that you can simply keep adding objects to a scene to create a new, more complicated state. Representing state in the real world seems like an incredible challenge, and perhaps POMDPs aren't a good solution.
DeleteI'm interested in your idea of modeling sub-goals for a POMDP and factoring the POMDP problem in an analogous way to factoring the joint distribution in a graphical model through independence assumptions. It seems to me that if states are independent of each other, then you might be able to perform this kind of separation.
DeleteI'm with Kurt on the first bit – it's not clear to me that the POMDP is designed for language-use. The core of the model seems to be turning beliefs about an environment into a plan of action, which doesn't fit neatly into my sense of the `language problem`. Many of these applications require language use, though, so perhaps I'm off base here. One challenge of the POMDP is that it must be given knowledge of the domain to get off the ground – in the Hoey paper it is stated that their model is “specified manually, using prior knowledge of the domain” (p. 503, Section 3). Based on their description of the model, it sounds like this domain knowledge consists of the planstep graph and the users' possible behavior (PS & BE – section 3.1). If we imagine applying the POMDP to a less compact task than handwashing, say, a chef-robot that prepares meals, then this knowledge is not nearly as compact. Furthermore, there might be more than one way for the agent to achieve subgoals toward task completion (e.g. we can imagine encountering novel environment configurations in which knowledge of possible behavior is not already encoded in the planstep graph). One way to overcome this challenge would be to equip the agent with a means of ammending its planstep graph (and associated BE set) via language commands or queries (i.e. asking for information regarding the equality of certain tasks). In other words, we can imagine a compositional approach to domain knowledge that augments the planstep graph (e.g. “does stirring the eggs with a fork have the same effect as stirring it in a mixer?”, etc.).
ReplyDeleteThere are two uncertainties in the robot’s world. First, its observations are not reliable enough to confirm its world state such as where the robot is and the surrounding objects are. Second, its actions may not be reliable enough to infer the next state of the world. Its intended actions may result in completely unacceptable outcomes such as overshooting when it intends to move to a certain distance or turn to a certain degree. The examples are provided in case of a mobile robot.
ReplyDeleteIf we apply this to a language-using agent where it picks by itself what utterance to speak, the situation is that the agent may be uncertain about what languages have been spoken from a user or what outcomes to expect when a certain phrase is spoken or a certain action is executed. There can be a challenge in adapting the dialogue context to a POMDP in a given situation. Modeling a world state, inferring a world state from a given phrase can also be challenging. Furthermore, since there are almost infinite numbers of possible language combinations in a single phrase, combined with possible actions or next phrases to speak, the number of possible states also become almost infinite. We need to give an agent either a context or restricted environment to bring down the problem into a tractable size.
We can also think of a language-using agent helping itself to decrease uncertainties in its observations and action outcomes. This case, it doesn’t pick its own utterances, but it can decide whether to ask a question or not. We can think of a case where the robot only sees a specific corner in a room and is having a hard time understanding which room it is. This means there are more than one possible candidate in choosing a room. Several rooms may have similar probabilities to become the current location. The robot may ask a question like ‘Which room am I currently in?” or “Could you confirm which room I am in among the room 134, 121?”.
The robot can also ask questions to confirm its action and be rewarded by that. It can ask questions like “Did I perform this action correctly?” or “Did I finish pouring the drink inside the cup into the bowl?”. In an application like this, we need to decide how to apply the answer to the given probability equations.
I like the idea of having the robot ask for confirmation about the state of the world. As people have said, one of the most expensive parts of the computation is having to allow for any number of possible worlds. Asking a clarifying question would increase the certainty about the belief state and therefore allow for a more focused and accurate calculation. However, would there then be a problem of having to believe the input of the human helper's response? I think it would be interesting to see if the answer helps reduce uncertainty about the state of the world more than the uncertainty about understanding the response adds to the confusion.
DeleteI'm interested with your idea that the human helper's response might actually increase the confusion - maybe we can have some kind of entropy measurement to measure the value of the information gained from the human's answer.
DeleteI think Jun's idea about award robot's asking reasonable questions is really great. So the robot can reduce the uncertainty by asking human questions and get rewarded by that. And this makes me think of the question: why can't human directly help robot to reduce uncertainty of observation?
DeleteI found an extended POMDP model called HOP-POMDP which takes advantage of humans who are ready in the environment to receive observations. It is developed by Rosenthal[1]. It takes into account the availability and cost limitations of humans and it can ask human to obtain accurate observations. Apart from the original model of POMDP , HOP-POMDP add two additional elements λ, α to the tuple which correspond to cost of asking each human and availability for each human respectively.
Delete[1] Stephanie Rosenthal, Manuela Veloso. "Modeling Humans as Observation Providers using POMDPs." In Proc. International Symposium on Robot-Human Communication (Ro-Man 2011), pgs. 53-58, July 2011.
POMDP's can be used to solve Markovian tasks. The idea I assume would be to break down tasks, in this case a language problem into Morkovian states that are possible, however expansive or non-stochastic (with some loops in graphs). This leads to large number elements/ nodes/ branches in the graph produced. In the language problem an observation can be something said by an user plus the other modes of observation like motor spin counting/ computer vision/ any other sensor that the agent can have. And the action can involve dialogue with the user. Speech observations will be noisy as speech to text technology is bad, also the parser can be terrible. So our belief of the state will be bad, but assuming we can break our task into a relatively do-able finite state Markovian process we can make it tractable with different methods mentioned in the paper. I guess that is what the Hoey paper demostrates, how to break a problem into a POMDP problem such that it is tractable. I agree with Kurt that we cannot solve the whole task of language with POMDP's, but we can break down problems into workable sizes.
ReplyDeletePOMDP model is used to solve tasks with Markov property, i.e., the transition probabilities and the expected rewards are dependent only on the present state and are independent on the previous history. However, as we discussed in the previous classes, the meanings of natural language can be highly context dependent. For instance, in the robot navigating example mentioned in L. Kaelbling's paper, we can have commands either like "go to the northeast corner of the fourth floor" that are always unambiguous, or like "go to that corner" that might only be resolved after consulting the previous dialogue or action history. A possible solution might be maintaining a separate semantics model and keeping it updated after receiving each language command and executing each action, and using it to resolve ambiguity when needed.
ReplyDeleteAnother challenge that I can think of is that human language, as is showed in the previous classes as well as the G^3 paper, is highly flexible. If language is used in the formation of the state space, it might result in an enormously huge and intractable space. Therefore, we might need another layer between language and the robot's internal states. For example, first use semantic parsing to convert language into a concise logical form, then use the resulting logical form to construct the state space, etc.
I think one of the challenge you pointed out is the system's ability to adapt to users, Based on the needs and preferences of the user, the system should be able to update the interaction strategy. Adaptivity depends on the input generality since constraints posed by input specificity will reduce the potential for exploration by the system. Like 'go to that corner', the system can reference to the previous dialogues, but what happen if the previous dialogues/action histories do not provide enough information for the system to act(e.g, there's a threshold that the robot should not act if the probability is lower than it)? I think it can study SHRDLU to ask questions to clarify it. As in Hoey's paper, its system only has a limited adaptivity. And they are investigating Bayesian reinforcement learning methods to make the system adaptive to users. Moreover, for the second challenge in your post, is it possible to update the existing design instead of adding a new layer because of the time complexity? I think both of the papers have time complexity issues so we should try to avoid making it slower.
DeleteOne problem you could encounter in using a POMDP to drive a language using agent is the size of the state space. If the agent has to interact with humans using natural language, the state space is amplified via ambiguity and synonyms. For example, in Hoey, Jesse, et al. the state space is divided up quite conveniently. Among several other metrics, the algorithm is ability tell (fairly quickly) if the hands are near the faucet, towel, etc. imagine if instead the user were describing what they were doing verbally. There are many ways different users could describe the same task, and although it could be possible to train a parser as we saw in G^3, the resulting state space would be significantly larger than before.
ReplyDeleteAlong a similar note, the POMDP requires the agent to use a reward function for the devaluation of policies. This means that the agent has to be able to decipher the user's verbal intentions. This seems like a much more difficult task than simply observing the users actions and inferring intention from them.
I think that a language using agent must supplement language with other information in order for us to successfully apply the POMDP model. Examples of such additional knowledge could be physical and/or visual environment information. This would allow it to significantly narrow the state space that it is using. In addition, perhaps the POMDP could be better used as an inference engine. It seems that a POMDP would be more apt at handling inference under uncertainty than dealing withe a more classic grounded language problem. One advantage of using it during inference would be its information retrieval and reinforcement learning aspects. These would allow it to collect supplementary information or ask for help when in need.
Most people (myself included) have noted potential problems in state-space, but to be clear: a POMDP-driven robot could still use NLP to inform the state-space while not causing an explosion in space size. The state can be parameterized not by sensors but by intermediate values that are informed by sensors. Then NLP is just one sensor among many.
DeletePOMDP for inference is an idea I don't think I've encountered before, so I thought I'd explore that. I'm thinking of different ways to model it, and I'll admit my first passes don't make too much sense. At first, states could be assignments and actions simply what things to associate. However, with the complexity of inference, state space is going to explode rapidly quickly. And this concept of actions is more or less meaningless, there are only transitions between states. I might actually see POMDP used as a subset of inference in the domain of procedural instructions. Meaning to say, we use POMDP as a model of certain things (e.g. triage, games) to great effectiveness. And while I think there are problems with directly using POMDP for task-action selection, they could be used to model our world understanding, and then based on that, perform certain actions, do tasks, etc. (not according to a cost function as in POMDP setting, but according to some goal-oriented task selection based on natural world understanding modeled by POMDP).
The POMDP model is used to make sequential decision processes under dynamic and uncertain environment. In the POMDP model, the effects of actions executed by agents are uncertain and observations made by agents are also uncertain. It should maintain a probability over possible states and observations.
ReplyDeleteFor a language-using agent like SHRDLU, there are several three involve language: understanding language, making inference based on language and generating language. As the solution of a POMDP model is the optimal actions for each possible belief over the world states, so it cannot be used to understand language which maps word meanings to groundings. For all models in SHRDLU, Tellex's system and Chen&Mooney's paper, they use symbolic rules or probabilistic models to create the mappings between language and physical objects. I think POMDP doesn't have this ability to understand natural language. For agent making decisions during dialogue, POMDP can work very well on it. For example, in a real setting, the command is "put the bottle on tray" under the condition that the robot has one hand in which there is a can. So with the belief states of the world and observations which are both uncertain in the real world for robots, POMDP can calculate a sequence of action to complete the task. For interacting with people, POMDP model can be used to generate responses or questions to people. So it can treat initial input natural language as world states, take responses and questions as the actions. So for the command like "bring me a cup of coffee", the robot can go to the place of maximum probability where has coffee. However, the "Markov property" --- the state and reward at time t + 1 is dependent only on the state at time t and the action at time t, limits the reference of the previous dialog. In a context-aware situation, the agent cannot look up for the previous information in the dialog due to the Markov property.
Although we keep the "Markov property" in a system, if we incorporate context into world states, we can still use POMDP. Therefore, keeping a context means stepping through different states in a given world. However, POMDP isn't appropriate if we explicitly want to keep the memory of context.
DeleteThere are a couple of issues with using the POMDP model. The current one is the computational complexity of solving POMDPs. For small state-action spaces this may be tractable, but for unbounded situations used in robotics this could all to easily blow up. RL for robotics, as it is, is already plagued with the curse of dimensionality. Add in the complexity of stochastic environments and it will only become worse.
ReplyDeleteThe second issue is that Markovian processes assume independence from previous states, but in a dialogue that is obviously an issue because the context of the dialogue provides meaning for future discussions. Just a simple word like 'it', requires the agent to have a recollection of the previous sentence. One could try to extend POMDP to Markov chains of a fine order, but that of course complicates issue 1.
I think the best solution here is to use a mix of stochastic methods and heuristic methods so that the dimensionality of the space can be reduced, and the stochastic methods only applied where absolutely necessary.
Do you have any further suggestions at limiting the dimensionality of the space? I like your first idea to expand the POMDP to handle Markov chains of a higher degree, but as you point out, that makes our state space size explode. I wonder if the using a completely memoryless model for the POMDP would have serious shortcomings regarding language, since language is riddled with memory-bound properties. As others have mentioned, we might imagine the agent prompting for clarity to isolate which state the POMDP ought to believe it is in. I wonder if we could pair this prompting (a la the heuristic you mentioned) with a memory augmented POMDP (i.e. Markov chains of size 2-4), or perhaps some form of bank that tracks salient features of previous utterances (e.g. one could imagine tracking salient words via repetition, phonetic emphasis, or uniqueness, and pairing this with the standard Markovian assumption to prune the search space based on topic relevance).
DeleteThe lack of linguistic context is an interesting point. Although I'm not entirely sure it's a problem at least in the case that the language system is being used to generate observations. Generation of observations happens completely independently from solving the POMDP. It doesn't require any information from the POMDP itself. Suppose a robot were trying to learn about it's environment which was a POMDP and it was brought to a dark room and someone said “The light is off” which is interpreted by a language module and understood. The robot after flailing around some amount hits the switch and then the person says “Now it is on.” The language module would have some built in model of the entire dialog (perhaps QUD?) and would be able to derive what “it” is. Thus making sense of things like pronouns doesn't seem to be affected by the assumptions of the POMDP model.
DeleteConsider a much smaller state space, with sub goals. My example here would be a scenario where the robot picks up a bottle based on a word command. The problem can be broken down into sub problems of understanding the NL command and picking up the bottle. Once the task of understanding the command is completed you do not need to go back or behind to that state, unless some information has changed. The tasks of understanding the NL command and task of picking up the bottle can now be broken down into new RL/ SL (supervised learning) tasks. So the space can be controlled (broken up into smaller sizes), given that your scene/ episode is limited in size. Obviously if you consider longer scenarios the state space explodes, but may be you can then search on parts or limited depth of the tree (in this case we can look at finite episodes of a fixed length) to limit our search space ( a simple example can be the robot is in the kitchen it really does not need to think about goals of watering the plants in the garden).
DeleteThe POMDP provides a framework for problems of actors trying to achieve goals in a world they can't perfectly observe. POMDPs are very difficult problems that require a great deal of computational power and time or careful construction in order to be tractable. POMDPs can be solved through reinforcement learning which generally requires a tremendous amount of training in order to find a solution. Otherwise programmers must provide hand-written models designed to be tractable which can be solved directly like Hoey et al do with the use of the SymbolicPerseus software. So using POMDPs is difficult in that it requires either a great deal of time and resources or very careful intervention by the designers for a particular domain.
ReplyDeletePOMDPs are useful for modeling actors trying to achieve goals but it's non-obvious what contributions they can make to machine understanding of language. In this class we are trying to solve a problem of grounding words into the world. POMDPs offer a method of “understanding” the world in a fairly realistic manner. We might plug some existing language interpreting system into a POMDP for acting by using our language system to extract “observations” from spoken sentences describing the world. We might also use the level of belief in the system to ask for clarifications (additional observations) about the world when an actor appears to get confused. So while POMDPs aren't necessarily the key to unlocking the meaning language they could likely be a useful way to act rationally based on some linguistic input.
I don't see anything in particular that makes driving a POMDP agent more
ReplyDeletedifficult using language, outside of the intractability of many reasonably
sized POMDP problems. It does allow the planner to capture a lot of the
ambiguity inherent in language and the groundings of specific NPs -- the
notion of partial observability nicely models uncertainty in perception.
Perhaps the larger problem -- even more problematic than their unsolvability --
is that it's not clear where the POMDP would come from or how it would be
constructed. Converting from a language plan to a POMDP would involve modeling
the relevant states, actions, observations, and the obervation function!
You might have some success by predefining a (limited) domain for the POMDP, or
by evading solving it directly: either through reinforcement learning or
approximating the solution as in Hoey et al. But it seems like the space of
states for most realistic domains would be far too large to perform value
iteration on, and would probably take too long for reinforcement learning as
well.
I think you want your POMDP to work in a restricted domain as you said the states for an unrestricted domain would be too large. In a (very) restricted domain, I think, a POMDP would deal with a reasonable number of states. For example, developing a POMDP model that works in the simulation world in which Chen and Mooney have tested their system would be feasible.
DeleteConstructing the reward function is tricky. In order for a POMDP model to learn how to act, it should map language inputs to rewards. Language inputs don’t correspond to some predefined actions. For example, given an instruction “go to X,” the model should realize that reaching to a position X would result in some positive reward. On the other hand, given another instruction, “stay away from X,” it should understand that going to a position X would result in some negative reward. Mapping words to some real values are essential to build a POMDP model.
ReplyDeleteIn an unrestricted domain, this kind of mapping would be very difficult. In a restricted domain where the model deals with somewhat limited number of objects and vocabularies, it might be able to learn how to act accordingly. Mapping words to objects in the real world seems doable. Given simple sentences, it may be able to learn how to parse them to something that later gets mapped to real numbers. Or it should learn how to map sentences directly to real values.
One question: on page 102 of L.P. Kaelbling et al. is the current state, s_t, at the bottom of the page, tth-to-last state? If not, I am not sure whether I understand MDPs.