CSCI 2951K: Topics in Grounded Language for Robotics: Tuesday, September 17: Grounded Semantics

Friday, September 13, 2013

Tuesday, September 17: Grounded Semantics

This week we will read a paper from cognitive science describing the connection between spatial language and spatial cognition. Specifically they study what geometric and perceptual features people appear to use to map between spatial language and objects in the external world.

Barbara Landau and Ray Jackendoff. “What” and “where” in spatial language and spatial cognition. Behavioral and Brain Sciences, 16:217–265, 1993.

For the assignment, let's try a different format to encourage more discussion.

By Sunday night at 5pm, please post a comment on the blog of about 200 words about any of the things we have discussed so far. You might describe a possible project, ask a question and present some possible answers, or compare and contrast ideas in what we've read so far and suggest areas you'd like to investigate more closely
By Monday night at 5pm, please reply to at least one other comment. Give them feedback about their ideas, try to answer their question, or expand on a point that you agree with. I'm looking for about 200 words total; it could be spread across several different comments.

47 comments:

UnknownSeptember 15, 2013 at 10:50 AM
So far, what we have already discussed are theories, about how to present semantics of words, phrases and sentences. Actually I am more interested in how to apply them in robotic world and how to implement them for an robot to use. For example, the Heim and Kratzer approach represents meaning(or denotation) of words, phases and sentences with functions. The basic logic here is clear after we discussed it in the class. But if we want to use this approach to implement a robot that can actually understand the world in some aspects , and is able to communicate with people in an interactive way, can you design some detailed implementation of the Heim and Kratzer approach? Also, we will need a predefined dictionary or discourse. For this part, can you come up any ideas about how can we generate such a dictionary at the very beginning? Or is there any learning method you that can make our robot more intelligent and be able to learn some new knowledge?
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 10:57 AM
Both Winograd's and Heitz and Kramer's approach to language representation faced a major problem in scalability. They both allowed for accurate definitions of a small subset of the world, and were even capable of defining new things in terms of old definitions, but failed when encountering something not definable in terms of the current world. I was curious as to what, if any, work has been done to tackle this issue of scalability, and what kinds of solutions were possible.
In a previous class, Professor Tellex mentioned a possibility of using Google Image Search to allow for the identification of objects, which is definitely an idea I would be interested in exploring more thoroughly. This type of recognition seems promising, both allowing for the recognition of new objects and more accurate command executions, but raises some new issues such as time complexity of building a classifier on the fly, and, as she mentioned in class, classifiers for uncommon or unique objects.
I also thought about what role learning/teaching could play in this, where you had a robot that would encounter a new object and ask a human (or something that could play the role of teacher) what it was, what it was used for, etc., like a small child would. While this seems like it could be promising and effective, there are major memory usage problems involved with this strategy.
I’d love to delve deeper into how we are/might tackle this problem of scalability in robotic recognition.
ReplyDelete
Replies
dabelSeptember 15, 2013 at 12:06 PM
(1) The methods of language representation we have visited so far are purely (non-empirical) symbolic systems – SHRDLU, and a machine of the Frege/Heim/Kratzer breed, relying on the functional application schema, do not incorporate (novel) empirical knowledge to inform their models of understanding. Given that the focus of this course is on Language as applied to the domain of *Robotics*, I'm curious about the ways in which one might leverage the systems for language understanding we have discussed so far (or design new ones) whose functionality includes processing empirical data (e.g. a system that includes sensors & classifiers, as well as a meta-language for incorporating novel information from the sensors into the agent's language model).

(2) My second question is targeted at the overarching goal of language and robotics. It seems apparent with SHRDLU that with some work, we ought to be able to design and build a robot that is capable of simulating language understanding to the extent that is required for a particular domain of applications
(i.e. manufacturing, a kitchen aid, moon rover, etc.). These 'domain specific' systems possess language capabilities that fall short when we breach the bounds of the domain that the robot is expected to occupy (and in rare other cases, too). Often times this breach is unsatisfactory to the user(s) of the bot, and can severely limit the fluidity of any interaction with the robot. In dire cases, the robot might be completely incapable from accomplishing its task. Ultimately, though, I'm wondering if the lack of flexibility and scalability in these domain specific models is an obstacle at all – is our ultimate goal to achieve a system that is capable of genuine, human-level, language understanding (i.e. *never* experience a 'breach')? Or, would we be satisfied with improving our 'domain specific' models? Do we care if a robot is 'simulating' language understanding, so long as it is capable of accomplishing the tasks necessary for its domain?
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 12:09 PM
So far, it seems that the ideas we have read all have something to do with a human specified set of grammar rules which can be later used to process human language input (e.g., Terry Winograd writes a collection of programs in the SHRDLU's parsing system, I. Heim and A. Kratzer propose a set of rules to project sentences to truth values). However, according to language acquisition (especially first language acquisition) theories, human children are able to develop a certain extent of grammar from a finite number of sentences that they encounter. Since programs are capable of processing a much larger amount of input data than human children, I am interested in what would the result be if we let robot to try to learn grammar rules directly from natural language inputs, e.g., the correctness and soundness of the learned grammar, would the learned grammar be consistent, would the learned grammar be more robust than the human-proposed rules, etc.
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 12:40 PM
Up to this point we have seen a couple of different approaches that have attempted to tackle the symbol grounding problem. While these approaches may differ in their application, the theory behind them is largely similar. The work of both Winograd and Frege/Heim/Kratzer took advantage of the comsitional nature of sentences to apply a logical framework to them. In particular, the focus seems to be on using lambda calculus to traverse a parse tree and generate a composite meaning for a sentence that is based in grounded smbols and predicates. From what we have seen this seems like a relatively fragile system that is unable to handle symbols that fall outside of its domain. I believe that the introduction of statistical techniques into this field has been an attempt to remedy this fragility.

The introduction of probability and statistics addresses two core issues: 1) learning from "experience" and 2) modeling uncertainty. The first goal is relatively straight forward, if we want a machine that can learn from its environment it needs to be able to recognize patterns and be trained based on previous examples. The second goal might be less obvious, but much more far reaching. Unless we are dealing with a simulation, uncertainty will permeate whichever environment the robot finds itself in (ie. disambiguation of word meaning, event likelihood, or interpreting human actions). In order for our robot to make educated decisions in such an environment it needs to be able to model this uncertainty to some degree.
ReplyDelete
Replies
Lixing LianSeptember 15, 2013 at 12:55 PM
In the previous topics we were focusing on how a robot understands and represents word/sentence meanings. Although it’s better to design a perfect algorithm that the robot will not have any mistakes, but it is inevitable that there are some mistakes happened during the learning process for a robot. But it seems neither of these papers mentions how they correct the mistakes and I am wondering if there are some ways to do so and update their information in time. For example, if a robot picks up a red box though the user is asking it to pick up a blue one, for a good mistake correctness algorithm, the user can tell the robot it should pick up the blue one but not the red one, and it will not make the mistake in the future by updating its knowledge (It seems it is similar to the way how SHRDLU learns a new word. Obviously the robot will keep making mistakes if we do not correct them).
ReplyDelete
Replies
akovacsSeptember 15, 2013 at 1:07 PM
The papers we have looked at so far model language as procedures or functions that are evaluated by an agent to produce some change in the world. This approach reminds me of program interpreters and how we model programming languages. It might be interesting to consider how me might make this approach more modern by treating English language commands as programs and uses the approaches we have for interpreting those.

Perhaps a robot could have a “core” programming language which is made up of those simple, essential commands such as “move(x,y)” or “moveArmToPosition(x,y,z).” It could then have a higher level language with features like “pick up” and have an environment containing those objects about which it is aware. Commands in this higher level language would be desugared into more basic commands. The “surface syntax” would be English commands. These commands would be probabilistically parsed into the high level language.

This approach could allow us to better separate out the different kinds of learning and understanding that are going on. There's the understanding of what the action words in a command mean which is represented by the mapping from the surface English to the high level language. The problem of mapping symbols to the world is encompassed in the design of the environment. Finally the question of knowing how to perform some action is represented in the desugaring from the high level language to the low level language.

This thought, of course, doesn't necessarily provide a way to surpass the limitations of these papers but perhaps gives us a more up-to-date and useful model for thinking about them.
ReplyDelete
Replies
Jun Ki LeeSeptember 15, 2013 at 1:22 PM
From the previous classes in this course, we have seen that it is possible that humans and machines can have conversation, share knowledge about a given task, and conduct that task when the context and the number of words they speak are limited. We also saw the real and virtual world examples such as SHRDLU, ikea assembly bots, and direction following quadrotors.
Given this, for the class project, I want to investigate in re-implementing some of these examples and bring the examples to the remote situations where a human user is meant to tele-operate the robot.
In the teleoperation scenario, I would like to pursue increasing the robot’s autonomy as much as possible to ease the cognitive load of an operator. Instead of operating the robot for each given action, I aim for users to give instructions in natural language. I am also interested in building semantic world models to help the robot understand the surroundings and better understand human instructions.
In re-implementing the example, I would like to investigate what components are missing and what are the possible ways to make given tasks more scalable so that we can be adapted to real world situations with ease.
On the other hand, I would also like to investigate what other scenarios are possible given that we have to limit the size of total words spoken and the context has to be set.
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 1:26 PM
The two approaches previously discussed were both trying to define a general way of representing the meaning of all words and word types with one unified theory. In contrast, the Landau and Jackendoff explicitly divides the lexicon into two ideas, ‘what’ and ‘where,’ and examines them separately. While it might not be realistic to neatly divide all words into categories such as ‘what,’ ‘where’, ‘movement,’ it helps to acknowledge that different word types will have different word meanings. It might help a robot understand if it can look in a smaller set for a word’s meaning.

I mentioned something along these lines in my last post, but like others I’m wondering what the goal is for robotic language. Are we trying to have a robot ‘understand’ language in a philosophical way like humans go, or trying to have the robot accurately interpret language in order to function appropriately. In my opinion, I don’t see trying to replicate human thought processes as a desirable, even if possible, goal. The theories we’ve studied are trying to model how humans represent meaning cognitively. I think that is a very interesting subject and might provide ideas of systems to use for robots, but not the solution. Instead, I see the process as something more like creating a constructed language. For example, Esperanto. People studied the way languages work, what seemed logical and simple and ‘correct,’ and invented a language based on that. For robots, I see it as a similar process. Use the knowledge gained from studying human cognition to design a system of cognition for robots.
ReplyDelete
Replies
NGSeptember 15, 2013 at 1:44 PM
In the course we have looked at two methods of representing natural language: SHRDLU, which combines a syntactic analysis with semantic analysis (with some heuristics), and the Heim and Kartzer approach that looks to convert natural languages to truth values. Both approaches try to break a large NLP problem into subsets/ scenarios that can be solved reasonably, hence face scaling issues (when targeting a real world problem). The SHRDLU approach is far reaching from the point of robotics because as a platform the simulated system had the ability to evaluate and manipulate its environment, which is required in a real world robot. I would be interested in a machine learning perspective to see if we can generate a value function over which we force the robot to learn language? Not the sort of Chinese learning problem, but where active feedback is given to the robot to learn, along with heuristics. What level of heuristics and environment would still have to be decided.
@Yujie Wan: If we throw a block wall of text at a program, how does it know what it has to replicate. Learning grammar rules in most languages have easy heuristics or a predictable model. If we throw that out, we would be trying to solve a larger problem, the feature space can be huge. At least with heuristics we can classify a word as a feature type (verb or noun or adjective). Plus there can be context related semantic problems. May be I am wrong here, but heuritics seem to ease the problem. I would also like a pure machine learning based formulation for language solving, but I think the combinations possible in natural language can be far more that those in object recognition (we named every object and its activities and their different order). I would like to know more about this too.
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 1:59 PM
In Thursday's lecture, we discussed about semantics and language meaning. The Fregean Program helps us to form structures to identify the truth-conditions of sentence components. In order to do that, Fregean Program put forward ways to represent sentence components using functions. It divides the sentence into different parts and tries to form uniform functions for various words if they are of same kind. Truth conditional values that we get from context can be applied with Fregean Program. When the final outcome of the Fregean Program turns out to be not a truth-condition, the sentence is uninterpretable. In computational linguistic class, we learned about how to analyze sentences in components using dictionaries that have been created before. However ,it will take an hour or so to analyze a file with several thousand words. This might be too slow for building a real robot. How is Fregean Program used in designing robots? Are there any efficient algorithms? Since Fregean Program finds the truth-conditions based on context, we need to build context for robots. How do we do that? And what if the robot comes across some words that are not in the dictionary?
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 2:09 PM
I have a question: what are we supposed to learn from the previous readings and discussions? I don’t see how they can help us develop intelligent systems that could “understand” natural languages.

SHRDLU was a nice attempt in the past but I don’t see how SHRDLU can be extended in a meaningful way that helps constructing smarter robots. It can only execute commands, which it has already memorized. This is the very opposite of statistical approaches that we are interested in.

Mapping English sentences to lambda calculus is nice. But what do we do with it? It doesn’t really provide us much information. And I don’t see how lambda representation of English sentences could empower robots to “understand” English sentences.

Are we supposed to learn what not to do when we are building intelligent agents? Or are we supposed to feel how hard it is to build such agents? Or what am I missing?
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 2:54 PM
Hector Levesque gave a speech at IJAI later written up as "On our best behavior" [1]. In the paper he argues that a lot of AI is built on cheap tricks ("aka heuristics"). In short, things that get right answers for the wrong reasons. He gives an example prompt:

"Could a crocodile run a steeplechase?"
[ ] yes
[ ] no

This can be answered in several ways. The human approach would be something approximating a crocodile simulation. I know what a crocodile is capable of doing, approximately, and it's quite easy for me to determine that a crocodile couldn't clear hurdles.

A cheap trick relies on a closed world assumption: if I can't find an example of a crocodile running a steeplechase, they can't. Note that this is something we saw in SHRDLU too: hypothetical questions about the world were answered only when there was an example of the hypothetical in the block world.

Levesque proposes a better type of intelligence test, which he names a Winograd-schema test. There are many examples in the paper, which I won't include here. The essence is to exercise some expertise in disambiguating a question -- expertise in things like human behavior, material properties, appearance and so on.

The tricky thing is that in many domains -- vision and language come especially to mind -- you need to know things to learn things. This makes things very difficult to get going, as there's a circular dependency between understanding and itself.

I'd like to look at bootstrapping linguistic systems, where the primary objective is to provide some minimal (not necessarily small) understanding and use that to expand understanding via whatever means might be effective. Levesque does warn: "I do not think that we will ever be able to build a small computer program, give it a camera and a microphone or put on the web, and expect it to acquire what it needs all by itself." I agree. But I'd like to explore what it might take to do so theoretically, and try to make exploratory in-roads on understanding primarily with learned information.

[1] http://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 5:12 PM
How can a robot understand users' commands and tasks and complete them in everyday life? For example, "put the red block on the green block". In SHRDLU's approach, this command is of the (#PUT X Y) type and it should ground into specific symbols. Assuming that there are some assertions already in SHRLU: (#BLOCK B1). (#COLOR B1 RED). (#BLOCK B2). (#COLOR B2 GREEN). So this command is grounded into (#PUT B1 B2) and is executed by its executor. In Heim and Kratzer's approach, the parse tree is (ROOT
(S
(VP (VB put)
(NP (DT the) (JJ red) (NN block))
(PP (IN on)
(NP (DT the) (JJ green) (NN block))))))

there should be corresponding lexicon for words "the" "red" "on" and semantics rules to get the meaning of the command. These lexicons and semantic rules are all manually written. So as there are more and more sentence patterns and words, it will need a lot of labor to write down the corresponding rules.

If the system combines with the reasoning part which can make deductions about the current environment like in SHRDLU, much more handcrafted knowledge will be needed for the system. For the command mentioned above, if robot already has a blue block in its hand and there is a pyramid on the green block, system will make inference about what the robot should do to complete the command. The robot should first put down the blue block in its hand and then pick up the pyramid from the green block and put it on the table. At last, the robot can put the red block on the green block. So how does the system can make deductions like this? To achieve this, the system should have knowledge about its mental model(action model). It should know the preconditions and effects of each action. The robot couldn't pick up an object with some object already in its hand. So the precondition of the "pick up" action is that its hand should be empty. The effect of this action is that its hand has the object it picked up. With more and more actions incorporated into the robot, it is inevitable to manually written these knowledge to the system.

Furthermore, let's consider a more complex situation. As in a household environment, the users may ask a service robot to complete various everyday activities such as setting a table, preparing meals or serving drink. But such tasks are extremely knowledge-intensive, so the designer of the robot can not program them all beforehand. For example, when the human tells robot to prepare a meal for him, how does the robot know how to prepare a meal? These knowledge should either be handcrafted or gain from somewhere else.

In general, there are still many problems to implement a real robot that can serve people in household environment. But we can approach this goal by reimplementing a simple AI application such as blocks world using state-of-art techniques and methods.
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 6:35 PM
The real problem with the Heim and Kratzer reading is that it's all about context-free grammar and the title of the class is about grounded language. Grounding implies context awareness, not context-freedom. If grounding was as simple as context-free language plus context, wouldn't there be some better examples of it available by now? SHRDLU has some context, but ultimately it is still just manipulating symbols that, to the computer, are void of further meaning. This doesn't happen with people.

Every noun I use conjures up a fairly rich set of associations, some of which have to do with a widely-shared sense of the word's meaning, while others are more private. For lots of nouns there is personal experience behind them, along with the definitions in terms of other words. A lemon isn't just a "yellow citrus fruit" but is a fruit I've picked, tasted, observed, spit out, and laughed over. Abstract nouns like "justice" conjure up meanings in terms of disputes I've had with my sister, as well as stories of what happens in courtrooms, as well as a freshman philosophy class spent dissecting Rawls. This is the grounding behind the language I use. Robots, because they exist in the real world, promise the possibility of a similar kind of grounding, but I doubt we'll get there via the same roads that brought us context-free grammars.
ReplyDelete
Replies
BrawnerSeptember 16, 2013 at 9:43 AM
I'm floating a couple of ideas for final projects. The first is the question of how to provide additional unprompted context to robot commands or questions. Just by observing an agent is behaving in an undesired manner, users should be able to correct specific actions of the agent. Depending on which chunk of the command the agent is completing, a correction or update would carry different meaning. A sentence describing to pick up a block and put it on another block, requires two actions. If during the first, the user wanted to pick up a different block, he should be able to say, "No, the (other/red/closer) block", and expect the agent to try to remap the meaning. Whereas, if the robot was placing the block, that same sentence would carry different meaning and thus require different actions.

My second problem to explore, nascent though it may be, is how to map words to different domains. We say in class that limited language domains are required in order to create tractable understanding algorithms, but what if the system could switch its active domains based on the context of the language. If for example you were giving basic robot commands in the block world, but wanted to ask the robot a completely unrelated question, you could give enough context to make that question completely unlikely in the current domain, but statistically very likely in other domains.
ReplyDelete
Replies
Kurt SpindlerSeptember 16, 2013 at 7:45 PM
I'll float a few final project ideas. One would be to implement and explore the effectiveness of a knowledge representation engine. To be successful, it would probably have to rely on a number of different representations. Images (and perhaps 3d models), perhaps some notion of (fuzzy logic?) predicates, and perhaps probabilistic models such that entities are probabilistically linked with properties, associated concepts, and so forth. This might be effective restricted to a very specific subset domain, but it could be made much more powerful if it could learn new concepts or even ask clarifying questions to a human interviewer. The problem with a system such as this is that all parts of this system are 1. immensely complicated 2. seem very important, and 3. necessary, such that I imagine it wouldn't work nearly as effectively without any single component. It seems like a large undertaking, but might be made much more effective by restricting to a small subset, such as perhaps recipes to make cookies. On the other hand, I have a strong intuitive feeling that a large part of successful natural language processing is successful knowledge representation.

Another project possibility would be a replication or extension of that 'freshman tour' robot that we saw. I'm also interested in autonomous locomotion, so I found such an overlap between NLP and locomotion to be really interesting. The advantage of this project is that the natural language component seems more simple than the former project. On the other hand, there's all the locomotion stuff to figure out.

A third possibility that comes to mind is some type of computer voice control over say, web browsing. It would be an amazing interactivity model to be able to talk to web pages or your browser and say commands like, 'Put this link on my Facebook wall'. I know that Google Chrome just recently included some kind of Voice Interaction API in the latest versions of Chrome, and exploring the new interaction models that it affords would be really cool.
ReplyDelete
Replies

Add comment

Pages

Friday, September 13, 2013

Tuesday, September 17: Grounded Semantics

47 comments: