CSCI 2951K: Topics in Grounded Language for Robotics: Planning Under Uncertainty

Friday, September 27, 2013

Planning Under Uncertainty

This week we will read a pair of papers about POMDPs, Partially Observable Markov Decision Processes. This framework has been used for robot planning and perception, as well as spoken dialogue systems.

Leslie P. Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1-2), 1998.
Hoey, Jesse, et al. Automated handwashing assistance for persons with dementia using video and a partially observable Markov decision process. Computer Vision and Image Understanding 114.5 (2010): 503-519. Focus on section 3, the POMDP model.

Post by 5pm on Sunday about 25 words answering the following question:

What are the challenges in using a POMDP model to drive a language-using agent? How could that challenge be overcome?

By 5pm on Monday, post a reply to someone else's blog post. Suggest an alternate solution, ask a clarifying question, or point out something they might find useful.

35 comments:

UnknownSeptember 29, 2013 at 9:01 AM
One difficulty in using a POMDP is that the agent does not know which state it is in. Instead, it has a belief state based on a probability distribution over the possible states. This makes reasoning about the optimal action more complicated since instead of just having to compute the best action for the current state, it has to compute the values for all possible states. A possible solution is to use plan graphs, which do not require any explicit representation of the belief state. As demonstrated in the paper, often a plan graph can be simplified by eliminating the nodes that are never reached from the initial state. Furthermore, the size of the optimal plan graph increases with uncertainty about the world. Therefore, much more memory is needed to store the graph. One way to reduce the cost of computation is to implement probability threshold for the possible states. Instead of evaluating the policy tree for all states, disregard the states with a probability lower than the threshold and only consider the more likely states.

Another challenge is that not all POMDP’s converge to a finite optimal infinite-horizon value function, and as a result can only be solved approximately. The tiger example from the paper does, and therefore value iteration solves the problem. For POMDP’s that are not finitely transient, a solution could be approximating the model with a finitely transient policy, again disregarding the unlikely nodes, and then using value iteration.
ReplyDelete
Replies
UnknownSeptember 29, 2013 at 12:39 PM
I think the largest challenge raised by using a POMDP model is the computational requirements of doing value iteration for every possible state that an agent believes is possible. Luckily, language-using agents have a couple interesting ways of pruning down this computation. Lauren already mentioned that one way was to prune the states for which you perform value iteration by ignoring any belief state with a belief value too low. Language-using agents can expand on this by using language to communicate with humans and ask for clarification when the search space would be unreasonably large. POMDPs provide the perfect interface for asking clarification, since they already have some probability of belief for all possible states, this can simply be updated by new information. The part of this kind of pruning that I find truly interesting is the ease at which the language-using agent can treat human provided information as flawed or partial. For example, an agent searching for an object in a house could ask a human for the object’s location, and could receive varying replies with varying correctness, such as “By the front door”, “Upstairs”, or “In the living room or kitchen”. For each of these responses, the agent could simply increase its belief in the states for which those statements applied, allowing more general directions to be easily incorporated in the agent’s action.
ReplyDelete
Replies
UnknownSeptember 29, 2013 at 12:42 PM
There are a lot of ways that actions could be improved based on the current state. For example, a less aware patient may require stronger prompting. While the POMDP model has the potential for responding to all these situations differently, it does not provide an accurate method of doing so.
One way to solve this is to create another model for output, but having it efficiently use all factors may be difficult. We could combine a language model with a learning algorithm that learns via the favorableness of actual outcomes. In other words, the favorableness of outcomes from the current POMDP model would help drive an output language synthesizer which in turn drives actions for the current POMDP model. It’s possible that this sort of circular approach could magnify mistakes in initial learning. So creation of the original states, variables, and their probabilities should be independently set up first.
Because the task is limited and so well-scoped, we can use a model with a small word library such as SHRDLU. However, SHRDLU’s method of processing through memory is inappropriate for this task and context should be acquired from the POMDP model instead. How this works exactly is unclear since POMDP works mostly by probabilities. However, we can make a distinction between state and action, allowing for this difference to stand. Even though POMDP is using probabilities to determine possible states, SHRDLU could take one or two of the most likely states in addition to any variables, converting them into an action.
ReplyDelete
Replies
UnknownSeptember 29, 2013 at 12:57 PM
I think using POMDP has two challenges. The first is what is a belief state. POMDP is an agent which can not observe the current state. It use action and resulting state to make an observation. In POMDP, it has a belief state which can summarize the agent's previous experience. In order to get the the action effectively, agent should have a degree to judge uncertainty. In order to solve this problem, people give the agent a current believe state and no additional past data can increase the agent's expect rewards. As long as the agent does not observe the final state, it always has some none zero belief.

Another challenge which I think is the biggest challenge is finding an approximation to the optimal value function. According to MDPs, if we can get the optimal value function then we can get the right optimal policy. In the MDP case, it used value iteration to get the value function, so authors of this article also used value iteration and "exhaustive enumeration" to get POMDP's value function. However, this algorithm is not efficient, it still does some extra work than necessary. So, authors of this article used a new algorithm which is witness to solve this problem.
ReplyDelete
Replies
Lixing LianSeptember 29, 2013 at 1:36 PM
POMDP agent cannot observe the current state. It makes an observation based on the action and result state. As in Leslie’s paper, a weakness of the methods in the paper is the state representations of the world. They require the states to be represented enumeratively rather than through compositional representations. Memory of previous actions and observations are necessary to reduce the ambiguities of the states of the world. The state estimator should keep updating the belief state which summarizes its previous experience based on the last action, the current observation and the previous belief state accurately. If the process is long, the time complexity and memory complexity may be very high. Moreover, a POMDP agent must map the current belief state into action. Another challenge is how to make the mapping accurately. They approach the problem using value iteration to construct to compute the optimal value function, then uses it to directly determine the optimal policy to map belief state into action. In Hoey’s paper, it says the mapping process requires a form of temporal abstraction. Also, like the example in that paper, since record of handwashing operations can be very fast(a new hand location estimate can be made every 50-100 ms), it should not update the state of the POMDP that quickly because the time scale at which prompting actions themselves occur. When the policy should be consulted in the POMDP and when the belief state should be updated are also challenges for POMDP, and in Hoey’s paper they adopted a simple heuristic to decide(updates belief state in these two situations: when the hand position change will cause a significant change in the belief state or the belief state has not changed for a long time(timeout)).
ReplyDelete
Replies
UnknownSeptember 29, 2013 at 1:49 PM
The POMDP model have 4 parts: a finite set of actions, a stochastic transition model, a stochastic observation model, a reward. The system state is not certain. We need a policy to map belief states into choices of actions. The POMDP needs to decide what is the next action based on current state.In Hoey's paper,the variables determining states includes: task, attitude, book-keeping variables. However the states cannot be observed exactly. Thus the challenge for using POMDP stated in the papers is to find an optimal policy to map belief states into choices of actions.

What is more, the states in POMDP may be infinite, which makes the value functions expensive to compute. The Leslie's paper put forward a method to solve POMDPs exactly in some cases. First, use policy trees to find belief states, the writers use the convexity function of optimal value to find the best solutions.And then set value functions as sets of vectors and get rid of values that are unimportant or dominated by other policy trees. The witness algorithm As the belief of each action to get information from environment decreases, the algorithm needs more steps. The witness algorithm simplified the computation to some extent but it is time-consuming for larger POMDPs. Using the approximation method is a better solution.
ReplyDelete
Replies
Kurt SpindlerSeptember 29, 2013 at 1:52 PM

First off, I worry that I'm missing some nuance to this question, as POMDPs seem to be a model of task/action planning and ill-suited to language use. Such an agent could use language as sensor and/or actuator, but that's a detail tangential to usage of POMDPs. With that said, I'm also we're talking about modeling a 'real-world' agent, and always in modeling the real world, the challenges are almost endless. There's the problem mentioned by several already, that normal algorithmic solutions (VI, etc.) aren't tractable. The optimization function is constantly changing, and not constant even to an individual let alone a population. A simple example: say the robot is low on battery, it's cost function should change from its current task so that it's first priority is to plug itself in and re-charge. There are value-estimation problems: how valuable is it to take a given action? There are state-feature problems: which exact state-action pairs do you update? If the world were truly, fully parameterized, you'd never see the same state again and never learn anything. Additionally, we'd like to learn things about objects in the world rather than states. We'd like behavior from a state where object A sits at (x,y,z) in the world to teach us about object A, not that world-state as in the classical POMDP setting. In some cases these descriptions point towards solutions, even if the implementation of such isn't clear. In the case of mentioned object A, we'd like to learn features of object A that inform our current state, rather than a purely parameterized description of the state. Some problems seem somewhat more systemic to POMDPs: non-fixed costs and cost estimation in particular. I've generally thought of POMDPs as a fairly powerful framework, but listing through these reasons makes me more skeptical about them. Something more goal-oriented (goals and subgoals, etc.) rather than transition-oriented seems much more appropriate. Coming up with one rough sketch of a fuller description below *, but I look forward to hearing alternate opinions. :)
* Description: perhaps some kind of goal-action space (cf. state-action space), where for a given goal, there are a set of actions that can be taken. Goal is specified independently of state, although there is perhaps a function satisfaction(goal, state): how much a world state satisfies a given goal. Effectiveness of an action is effectiveness(goal, action, state), with action-selection is the action argmax of that function at the given (goal, state) tuple. Maybe my key argument here is simply: in a given state, humans can have many different goals, so that should be set up as a different parameter, rather than encoded into a global value function. Note this parameterization doesn't address the issues of object generalizations across state or tractability (indeed, makes it worse!).
ReplyDelete
Replies
dabelSeptember 29, 2013 at 2:02 PM
I'm with Kurt on the first bit – it's not clear to me that the POMDP is designed for language-use. The core of the model seems to be turning beliefs about an environment into a plan of action, which doesn't fit neatly into my sense of the `language problem`. Many of these applications require language use, though, so perhaps I'm off base here. One challenge of the POMDP is that it must be given knowledge of the domain to get off the ground – in the Hoey paper it is stated that their model is “specified manually, using prior knowledge of the domain” (p. 503, Section 3). Based on their description of the model, it sounds like this domain knowledge consists of the planstep graph and the users' possible behavior (PS & BE – section 3.1). If we imagine applying the POMDP to a less compact task than handwashing, say, a chef-robot that prepares meals, then this knowledge is not nearly as compact. Furthermore, there might be more than one way for the agent to achieve subgoals toward task completion (e.g. we can imagine encountering novel environment configurations in which knowledge of possible behavior is not already encoded in the planstep graph). One way to overcome this challenge would be to equip the agent with a means of ammending its planstep graph (and associated BE set) via language commands or queries (i.e. asking for information regarding the equality of certain tasks). In other words, we can imagine a compositional approach to domain knowledge that augments the planstep graph (e.g. “does stirring the eggs with a fork have the same effect as stirring it in a mixer?”, etc.).
ReplyDelete
Replies
Jun Ki LeeSeptember 29, 2013 at 2:14 PM
There are two uncertainties in the robot’s world. First, its observations are not reliable enough to confirm its world state such as where the robot is and the surrounding objects are. Second, its actions may not be reliable enough to infer the next state of the world. Its intended actions may result in completely unacceptable outcomes such as overshooting when it intends to move to a certain distance or turn to a certain degree. The examples are provided in case of a mobile robot.
If we apply this to a language-using agent where it picks by itself what utterance to speak, the situation is that the agent may be uncertain about what languages have been spoken from a user or what outcomes to expect when a certain phrase is spoken or a certain action is executed. There can be a challenge in adapting the dialogue context to a POMDP in a given situation. Modeling a world state, inferring a world state from a given phrase can also be challenging. Furthermore, since there are almost infinite numbers of possible language combinations in a single phrase, combined with possible actions or next phrases to speak, the number of possible states also become almost infinite. We need to give an agent either a context or restricted environment to bring down the problem into a tractable size.
We can also think of a language-using agent helping itself to decrease uncertainties in its observations and action outcomes. This case, it doesn’t pick its own utterances, but it can decide whether to ask a question or not. We can think of a case where the robot only sees a specific corner in a room and is having a hard time understanding which room it is. This means there are more than one possible candidate in choosing a room. Several rooms may have similar probabilities to become the current location. The robot may ask a question like ‘Which room am I currently in?” or “Could you confirm which room I am in among the room 134, 121?”.
The robot can also ask questions to confirm its action and be rewarded by that. It can ask questions like “Did I perform this action correctly?” or “Did I finish pouring the drink inside the cup into the bowl?”. In an application like this, we need to decide how to apply the answer to the given probability equations.
ReplyDelete
Replies
NGSeptember 29, 2013 at 2:54 PM
POMDP's can be used to solve Markovian tasks. The idea I assume would be to break down tasks, in this case a language problem into Morkovian states that are possible, however expansive or non-stochastic (with some loops in graphs). This leads to large number elements/ nodes/ branches in the graph produced. In the language problem an observation can be something said by an user plus the other modes of observation like motor spin counting/ computer vision/ any other sensor that the agent can have. And the action can involve dialogue with the user. Speech observations will be noisy as speech to text technology is bad, also the parser can be terrible. So our belief of the state will be bad, but assuming we can break our task into a relatively do-able finite state Markovian process we can make it tractable with different methods mentioned in the paper. I guess that is what the Hoey paper demostrates, how to break a problem into a POMDP problem such that it is tractable. I agree with Kurt that we cannot solve the whole task of language with POMDP's, but we can break down problems into workable sizes.
ReplyDelete
Replies
UnknownSeptember 29, 2013 at 4:42 PM
POMDP model is used to solve tasks with Markov property, i.e., the transition probabilities and the expected rewards are dependent only on the present state and are independent on the previous history. However, as we discussed in the previous classes, the meanings of natural language can be highly context dependent. For instance, in the robot navigating example mentioned in L. Kaelbling's paper, we can have commands either like "go to the northeast corner of the fourth floor" that are always unambiguous, or like "go to that corner" that might only be resolved after consulting the previous dialogue or action history. A possible solution might be maintaining a separate semantics model and keeping it updated after receiving each language command and executing each action, and using it to resolve ambiguity when needed.
Another challenge that I can think of is that human language, as is showed in the previous classes as well as the G^3 paper, is highly flexible. If language is used in the formation of the state space, it might result in an enormously huge and intractable space. Therefore, we might need another layer between language and the robot's internal states. For example, first use semantic parsing to convert language into a concise logical form, then use the resulting logical form to construct the state space, etc.
ReplyDelete
Replies
AnonymousSeptember 29, 2013 at 6:27 PM
One problem you could encounter in using a POMDP to drive a language using agent is the size of the state space. If the agent has to interact with humans using natural language, the state space is amplified via ambiguity and synonyms. For example, in Hoey, Jesse, et al. the state space is divided up quite conveniently. Among several other metrics, the algorithm is ability tell (fairly quickly) if the hands are near the faucet, towel, etc. imagine if instead the user were describing what they were doing verbally. There are many ways different users could describe the same task, and although it could be possible to train a parser as we saw in G^3, the resulting state space would be significantly larger than before.

Along a similar note, the POMDP requires the agent to use a reward function for the devaluation of policies. This means that the agent has to be able to decipher the user's verbal intentions. This seems like a much more difficult task than simply observing the users actions and inferring intention from them.

I think that a language using agent must supplement language with other information in order for us to successfully apply the POMDP model. Examples of such additional knowledge could be physical and/or visual environment information. This would allow it to significantly narrow the state space that it is using. In addition, perhaps the POMDP could be better used as an inference engine. It seems that a POMDP would be more apt at handling inference under uncertainty than dealing withe a more classic grounded language problem. One advantage of using it during inference would be its information retrieval and reinforcement learning aspects. These would allow it to collect supplementary information or ask for help when in need.
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 6:37 AM
The POMDP model is used to make sequential decision processes under dynamic and uncertain environment. In the POMDP model, the effects of actions executed by agents are uncertain and observations made by agents are also uncertain. It should maintain a probability over possible states and observations.
For a language-using agent like SHRDLU, there are several three involve language: understanding language, making inference based on language and generating language. As the solution of a POMDP model is the optimal actions for each possible belief over the world states, so it cannot be used to understand language which maps word meanings to groundings. For all models in SHRDLU, Tellex's system and Chen&Mooney's paper, they use symbolic rules or probabilistic models to create the mappings between language and physical objects. I think POMDP doesn't have this ability to understand natural language. For agent making decisions during dialogue, POMDP can work very well on it. For example, in a real setting, the command is "put the bottle on tray" under the condition that the robot has one hand in which there is a can. So with the belief states of the world and observations which are both uncertain in the real world for robots, POMDP can calculate a sequence of action to complete the task. For interacting with people, POMDP model can be used to generate responses or questions to people. So it can treat initial input natural language as world states, take responses and questions as the actions. So for the command like "bring me a cup of coffee", the robot can go to the place of maximum probability where has coffee. However, the "Markov property" --- the state and reward at time t + 1 is dependent only on the state at time t and the action at time t, limits the reference of the previous dialog. In a context-aware situation, the agent cannot look up for the previous information in the dialog due to the Markov property.
ReplyDelete
Replies
BrawnerSeptember 30, 2013 at 8:23 AM
There are a couple of issues with using the POMDP model. The current one is the computational complexity of solving POMDPs. For small state-action spaces this may be tractable, but for unbounded situations used in robotics this could all to easily blow up. RL for robotics, as it is, is already plagued with the curse of dimensionality. Add in the complexity of stochastic environments and it will only become worse.

The second issue is that Markovian processes assume independence from previous states, but in a dialogue that is obviously an issue because the context of the dialogue provides meaning for future discussions. Just a simple word like 'it', requires the agent to have a recollection of the previous sentence. One could try to extend POMDP to Markov chains of a fine order, but that of course complicates issue 1.

I think the best solution here is to use a mix of stochastic methods and heuristic methods so that the dimensionality of the space can be reduced, and the stochastic methods only applied where absolutely necessary.
ReplyDelete
Replies
akovacsSeptember 30, 2013 at 11:15 AM
The POMDP provides a framework for problems of actors trying to achieve goals in a world they can't perfectly observe. POMDPs are very difficult problems that require a great deal of computational power and time or careful construction in order to be tractable. POMDPs can be solved through reinforcement learning which generally requires a tremendous amount of training in order to find a solution. Otherwise programmers must provide hand-written models designed to be tractable which can be solved directly like Hoey et al do with the use of the SymbolicPerseus software. So using POMDPs is difficult in that it requires either a great deal of time and resources or very careful intervention by the designers for a particular domain.

POMDPs are useful for modeling actors trying to achieve goals but it's non-obvious what contributions they can make to machine understanding of language. In this class we are trying to solve a problem of grounding words into the world. POMDPs offer a method of “understanding” the world in a fairly realistic manner. We might plug some existing language interpreting system into a POMDP for acting by using our language system to extract “observations” from spoken sentences describing the world. We might also use the level of belief in the system to ask for clarifications (additional observations) about the world when an actor appears to get confused. So while POMDPs aren't necessarily the key to unlocking the meaning language they could likely be a useful way to act rationally based on some linguistic input.
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 12:27 PM
I don't see anything in particular that makes driving a POMDP agent more
difficult using language, outside of the intractability of many reasonably
sized POMDP problems. It does allow the planner to capture a lot of the
ambiguity inherent in language and the groundings of specific NPs -- the
notion of partial observability nicely models uncertainty in perception.
Perhaps the larger problem -- even more problematic than their unsolvability --
is that it's not clear where the POMDP would come from or how it would be
constructed. Converting from a language plan to a POMDP would involve modeling
the relevant states, actions, observations, and the obervation function!

You might have some success by predefining a (limited) domain for the POMDP, or
by evading solving it directly: either through reinforcement learning or
approximating the solution as in Hoey et al. But it seems like the space of
states for most realistic domains would be far too large to perform value
iteration on, and would probably take too long for reinforcement learning as
well.
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 12:59 PM
Constructing the reward function is tricky. In order for a POMDP model to learn how to act, it should map language inputs to rewards. Language inputs don’t correspond to some predefined actions. For example, given an instruction “go to X,” the model should realize that reaching to a position X would result in some positive reward. On the other hand, given another instruction, “stay away from X,” it should understand that going to a position X would result in some negative reward. Mapping words to some real values are essential to build a POMDP model.

In an unrestricted domain, this kind of mapping would be very difficult. In a restricted domain where the model deals with somewhat limited number of objects and vocabularies, it might be able to learn how to act accordingly. Mapping words to objects in the real world seems doable. Given simple sentences, it may be able to learn how to parse them to something that later gets mapped to real numbers. Or it should learn how to map sentences directly to real values.

One question: on page 102 of L.P. Kaelbling et al. is the current state, s_t, at the bottom of the page, tth-to-last state? If not, I am not sure whether I understand MDPs.
ReplyDelete
Replies

Add comment

Pages

Friday, September 27, 2013

Planning Under Uncertainty

35 comments: