Friday, September 13, 2013

Tuesday, September 17: Grounded Semantics

This week we will read a paper from cognitive science describing the connection between spatial language and spatial cognition.  Specifically they study what geometric and perceptual features people appear to use to map between spatial language and objects in the external world. 
For the assignment, let's try a different format to encourage more discussion.
  • By Sunday night at 5pm, please post a comment on the blog of about 200 words about any of the things we have discussed so far.  You might describe a possible project, ask a question and present some possible answers, or compare and contrast ideas in what we've read so far and suggest areas you'd like to investigate more closely
  • By Monday night at 5pm, please reply to at least one other comment.  Give them feedback about their ideas, try to answer their question, or expand on a point that you agree with.  I'm looking for about 200 words total; it could be spread across several different comments.

47 comments:

  1. So far, what we have already discussed are theories, about how to present semantics of words, phrases and sentences. Actually I am more interested in how to apply them in robotic world and how to implement them for an robot to use. For example, the Heim and Kratzer approach represents meaning(or denotation) of words, phases and sentences with functions. The basic logic here is clear after we discussed it in the class. But if we want to use this approach to implement a robot that can actually understand the world in some aspects , and is able to communicate with people in an interactive way, can you design some detailed implementation of the Heim and Kratzer approach? Also, we will need a predefined dictionary or discourse. For this part, can you come up any ideas about how can we generate such a dictionary at the very beginning? Or is there any learning method you that can make our robot more intelligent and be able to learn some new knowledge?

    ReplyDelete
    Replies
    1. For the real world objects and real world applications for robots, the dictionary needs to be directly connected to the robots’ perception system. Therefore, matching images and their 3d shapes (or structures) with object names can be done beforehand to run the application. This part is called ‘semantic mapping’. The rest of the information about the world can also be created using dialogues between robots and human users. Its relative location, it affordances can be taught during conversation. Nowadays, researchers use crowd-sourcing to gather data from anonymous human users. Amazon mechanical turks or Crowd flower services can help us build the semantic map of a given world and build semantic relationships using human dialogues.

      Delete
  2. Both Winograd's and Heitz and Kramer's approach to language representation faced a major problem in scalability. They both allowed for accurate definitions of a small subset of the world, and were even capable of defining new things in terms of old definitions, but failed when encountering something not definable in terms of the current world. I was curious as to what, if any, work has been done to tackle this issue of scalability, and what kinds of solutions were possible.
    In a previous class, Professor Tellex mentioned a possibility of using Google Image Search to allow for the identification of objects, which is definitely an idea I would be interested in exploring more thoroughly. This type of recognition seems promising, both allowing for the recognition of new objects and more accurate command executions, but raises some new issues such as time complexity of building a classifier on the fly, and, as she mentioned in class, classifiers for uncommon or unique objects.
    I also thought about what role learning/teaching could play in this, where you had a robot that would encounter a new object and ask a human (or something that could play the role of teacher) what it was, what it was used for, etc., like a small child would. While this seems like it could be promising and effective, there are major memory usage problems involved with this strategy.
    I’d love to delve deeper into how we are/might tackle this problem of scalability in robotic recognition.

    ReplyDelete
    Replies
    1. If we want robots to ask questions about objects and gather enough information to be used in its object recognition, we first need to know how we understand objects. David Marr’s approach in 1980s was that we are using 3d shape structures in our mental model to understand different kinds of objects and we can actually build many different models of objects with the variations of 3d shape structures. Computer vision scientists took a different approach and used 2D features to understand objects. It uses catalogues of appearances of objects from different viewpoints. This approach is similar to what Google uses in its image search. We seemed to use both of them to understand what an object is.

      Delete
    2. Hi Miles --

      This paper of Percy Liang's describes a statistical approach to learning word meanings to make a system that can answer questions about a geographic database:
      http://cs.stanford.edu/~pliang/papers/dcs-acl2011.pdf

      His more recent work generalizes this to FreeBase, a structured database about a wide variety of topics.

      Delete
    3. While the internet is clearly an enormous source of image data, I'm not sure that the results of an image search would necessarily be useful for a classifier for reasons fundamental to what image search and training data should be. Ideally, training data would be drawn from the same distribution as inputs. Image search results I think "should" show a platonic version of the query. If you try a search for "apple" this is largely what you get: views of ideal apples from the side. (Though is also demonstrates a lesser issue: translation between visual and linguistic representation can be lossy: apple (fruit) / apple (company)). For a robust visual model you would want inputs of the sort a robot would get in the wild: different lighting conditions, all different angles, obstructions, and so on. These would in turn I think make bad search results.

      Delete
  3. (1) The methods of language representation we have visited so far are purely (non-empirical) symbolic systems – SHRDLU, and a machine of the Frege/Heim/Kratzer breed, relying on the functional application schema, do not incorporate (novel) empirical knowledge to inform their models of understanding. Given that the focus of this course is on Language as applied to the domain of *Robotics*, I'm curious about the ways in which one might leverage the systems for language understanding we have discussed so far (or design new ones) whose functionality includes processing empirical data (e.g. a system that includes sensors & classifiers, as well as a meta-language for incorporating novel information from the sensors into the agent's language model).

    (2) My second question is targeted at the overarching goal of language and robotics. It seems apparent with SHRDLU that with some work, we ought to be able to design and build a robot that is capable of simulating language understanding to the extent that is required for a particular domain of applications
    (i.e. manufacturing, a kitchen aid, moon rover, etc.). These 'domain specific' systems possess language capabilities that fall short when we breach the bounds of the domain that the robot is expected to occupy (and in rare other cases, too). Often times this breach is unsatisfactory to the user(s) of the bot, and can severely limit the fluidity of any interaction with the robot. In dire cases, the robot might be completely incapable from accomplishing its task. Ultimately, though, I'm wondering if the lack of flexibility and scalability in these domain specific models is an obstacle at all – is our ultimate goal to achieve a system that is capable of genuine, human-level, language understanding (i.e. *never* experience a 'breach')? Or, would we be satisfied with improving our 'domain specific' models? Do we care if a robot is 'simulating' language understanding, so long as it is capable of accomplishing the tasks necessary for its domain?

    ReplyDelete
  4. So far, it seems that the ideas we have read all have something to do with a human specified set of grammar rules which can be later used to process human language input (e.g., Terry Winograd writes a collection of programs in the SHRDLU's parsing system, I. Heim and A. Kratzer propose a set of rules to project sentences to truth values). However, according to language acquisition (especially first language acquisition) theories, human children are able to develop a certain extent of grammar from a finite number of sentences that they encounter. Since programs are capable of processing a much larger amount of input data than human children, I am interested in what would the result be if we let robot to try to learn grammar rules directly from natural language inputs, e.g., the correctness and soundness of the learned grammar, would the learned grammar be consistent, would the learned grammar be more robust than the human-proposed rules, etc.

    ReplyDelete
    Replies
    1. I agree with and would like to expand on the last point you made. Inherently, language or its lexicon is not a closed set so it’s impossible to create a complete dictionary of all the words in the English language. Even creating a sufficiently large set of word meanings would take an extraordinary amount of man-hours. However, as you said using training data and principles from natural language process could generate satisfactory results. NLP already has tools to determine grammatical correctness and model topics. Although the initial training would take a lot of time, the robot would only need to go through the corpus once, and therefore in the end the training time is realistic.

      Delete
    2. It will be very exciting if robots can learn directly from natural language grammars and robots can behave like real human. I am very curious about the process that robots can learn from training and everyday work. If we get rid of unimportant data and keep only core words, can robots learn better? Human infants don't need to know "buy things" as a pair. They can just say "buy". How can we simplify the process?

      Delete
  5. Up to this point we have seen a couple of different approaches that have attempted to tackle the symbol grounding problem. While these approaches may differ in their application, the theory behind them is largely similar. The work of both Winograd and Frege/Heim/Kratzer took advantage of the comsitional nature of sentences to apply a logical framework to them. In particular, the focus seems to be on using lambda calculus to traverse a parse tree and generate a composite meaning for a sentence that is based in grounded smbols and predicates. From what we have seen this seems like a relatively fragile system that is unable to handle symbols that fall outside of its domain. I believe that the introduction of statistical techniques into this field has been an attempt to remedy this fragility.

    The introduction of probability and statistics addresses two core issues: 1) learning from "experience" and 2) modeling uncertainty. The first goal is relatively straight forward, if we want a machine that can learn from its environment it needs to be able to recognize patterns and be trained based on previous examples. The second goal might be less obvious, but much more far reaching. Unless we are dealing with a simulation, uncertainty will permeate whichever environment the robot finds itself in (ie. disambiguation of word meaning, event likelihood, or interpreting human actions). In order for our robot to make educated decisions in such an environment it needs to be able to model this uncertainty to some degree.

    ReplyDelete
    Replies
    1. You propose that including stats and learning would greatly help the model, and I think unquestionably it would. However, I think it's not nearly that simple. First off, it's important to frame the problem correctly. If we are strictly talking about the SHRDLU program, I think re-wiring it to use statistics would accomplish a fair bit (Fix the case where we asked it, "What are you holding [now]?", where it broke on the inclusion of now), but there are other things it wouldn't help to solve (Not learning from the pyramid placement attempt that they cannot support other pyramids.) However, I feel pretty strongly that the SHRDLU program 'cheats' in a very important way, in that it is given omnipotent and pre-determined information about the world. A counter-argument is that it reduces the problem specifically to natural language processing without having to worry about learning, knowledge representation, and perhaps other issues. On the other hand, these type of logic seems so critical to the language processing problem (How do you understand the most likely meaning of a sentence if you don't know what is plausible in the world?) that it seems misguided to push them aside.

      Delete
  6. In the previous topics we were focusing on how a robot understands and represents word/sentence meanings. Although it’s better to design a perfect algorithm that the robot will not have any mistakes, but it is inevitable that there are some mistakes happened during the learning process for a robot. But it seems neither of these papers mentions how they correct the mistakes and I am wondering if there are some ways to do so and update their information in time. For example, if a robot picks up a red box though the user is asking it to pick up a blue one, for a good mistake correctness algorithm, the user can tell the robot it should pick up the blue one but not the red one, and it will not make the mistake in the future by updating its knowledge (It seems it is similar to the way how SHRDLU learns a new word. Obviously the robot will keep making mistakes if we do not correct them).

    ReplyDelete
    Replies
    1. Comments about the reading paper:
      The paper indicates that the spatial representations linked to object names must provide enough different shape descriptions to distinguish all kinds of objects categorized linguistically on the basis of shape. Sometimes the shape of an object is critical to what an object is called. I agree that the importance of shape in object naming is particularly dramatic. For example, ‘toy bear’ may not really belong to the category ‘bear’. In this case, I think it’s possible to treat ‘toy’ as an adjective like the way to treat ‘big’ in the paper example. Or we can revert it to ‘bear toy’ which is clearly to be categorized as a toy.

      And actually, I supposed perceptual ground is the same as reference object mentioned in the paper and even in the linguistic representation of place. But the paper indicated that they do not completely overlap by giving an example: ‘The cat is on the mat’, the mat is the perceptual ground and the reference object, but ‘The cat is near the mat’, the mat is the perceptual ground but not the perceptual ground. I don’t understand how ‘on’ and ‘near’ make the difference.

      Delete
  7. The papers we have looked at so far model language as procedures or functions that are evaluated by an agent to produce some change in the world. This approach reminds me of program interpreters and how we model programming languages. It might be interesting to consider how me might make this approach more modern by treating English language commands as programs and uses the approaches we have for interpreting those.

    Perhaps a robot could have a “core” programming language which is made up of those simple, essential commands such as “move(x,y)” or “moveArmToPosition(x,y,z).” It could then have a higher level language with features like “pick up” and have an environment containing those objects about which it is aware. Commands in this higher level language would be desugared into more basic commands. The “surface syntax” would be English commands. These commands would be probabilistically parsed into the high level language.

    This approach could allow us to better separate out the different kinds of learning and understanding that are going on. There's the understanding of what the action words in a command mean which is represented by the mapping from the surface English to the high level language. The problem of mapping symbols to the world is encompassed in the design of the environment. Finally the question of knowing how to perform some action is represented in the desugaring from the high level language to the low level language.

    This thought, of course, doesn't necessarily provide a way to surpass the limitations of these papers but perhaps gives us a more up-to-date and useful model for thinking about them.

    ReplyDelete
    Replies
    1. Interpreting natural languages is so much trickier than interpreting programming languages. It is possible to interpret programming languages because they follow a few strict rules. To interpret a programming language command, you just need to follow the rules. On the other hand, natural languages don’t really obey certain rules. There are many exceptions to the rules, and many words, phrases and sentences are ambiguous. I think that treating natural languages as programming languages would not help robots to “understand” natural language commands.

      Delete
  8. From the previous classes in this course, we have seen that it is possible that humans and machines can have conversation, share knowledge about a given task, and conduct that task when the context and the number of words they speak are limited. We also saw the real and virtual world examples such as SHRDLU, ikea assembly bots, and direction following quadrotors.
    Given this, for the class project, I want to investigate in re-implementing some of these examples and bring the examples to the remote situations where a human user is meant to tele-operate the robot.
    In the teleoperation scenario, I would like to pursue increasing the robot’s autonomy as much as possible to ease the cognitive load of an operator. Instead of operating the robot for each given action, I aim for users to give instructions in natural language. I am also interested in building semantic world models to help the robot understand the surroundings and better understand human instructions.
    In re-implementing the example, I would like to investigate what components are missing and what are the possible ways to make given tasks more scalable so that we can be adapted to real world situations with ease.
    On the other hand, I would also like to investigate what other scenarios are possible given that we have to limit the size of total words spoken and the context has to be set.

    ReplyDelete
    Replies
    1. So in the process of robot completing a task, there is a lot probablity that the robot will fail executing any some action. So with the help of human teleoperating when robot fails to execute an action, the robot will complete a given task better. In addition, robot will resume autonomously execute its remaining action after human help it complete the failed action. For example, in a serving drink scenario, the robot aims to serving a bottle of coke from dining_table to the human. When it fails in navigating to the dining_table, the user will teleoperate it navigate to the dining_table and the robot will segment the table, recognize the coke and grasp it. In fact, Professor Manuela Veloso from CMU proposed an idea "symbiotic autonomy" for robotics which human and robot can complete a task together. There is a human-centered task planning inside robot. Maybe it's the cutting-edge research of human robot interaction.

      Delete
    2. I agree with Zhiqiang -- I think it's interesting to think of robotic systems where failure is anticipated and human correction accounted for in recovery. A good analog might be efforts to improve error messages in programming languages, which make a big difference IMO.

      Delete
  9. The two approaches previously discussed were both trying to define a general way of representing the meaning of all words and word types with one unified theory. In contrast, the Landau and Jackendoff explicitly divides the lexicon into two ideas, ‘what’ and ‘where,’ and examines them separately. While it might not be realistic to neatly divide all words into categories such as ‘what,’ ‘where’, ‘movement,’ it helps to acknowledge that different word types will have different word meanings. It might help a robot understand if it can look in a smaller set for a word’s meaning.

    I mentioned something along these lines in my last post, but like others I’m wondering what the goal is for robotic language. Are we trying to have a robot ‘understand’ language in a philosophical way like humans go, or trying to have the robot accurately interpret language in order to function appropriately. In my opinion, I don’t see trying to replicate human thought processes as a desirable, even if possible, goal. The theories we’ve studied are trying to model how humans represent meaning cognitively. I think that is a very interesting subject and might provide ideas of systems to use for robots, but not the solution. Instead, I see the process as something more like creating a constructed language. For example, Esperanto. People studied the way languages work, what seemed logical and simple and ‘correct,’ and invented a language based on that. For robots, I see it as a similar process. Use the knowledge gained from studying human cognition to design a system of cognition for robots.

    ReplyDelete
    Replies
    1. I think this is a good point, just having a very human-like knowledge representation doesn't make a robot useful and is probably more likely to make the problem more difficult. Although we are likely to end up representing knowledge in some statistical matter and from my experience with cog sci courses it seems that this is pretty popular as a model for how human beings represent linguistic understanding. It is of course still useful to talk about human beings since they are the most functional language understanding system we have.

      Delete
    2. I think people are trying to mimic the human’s representation of word meanings because nobody really knows how to represent word meanings differently. If people knew a different way of effectively representing word meanings, we would already have some remarkable AI systems. I think it doesn't really matter how robots represent word meanings as long as they work. I don’t know if constructing a new way to represent word meanings is easier than simulating what we do. I would take an easier approach, if I had a clue.

      Delete
    3. I agree insofar as replicating human representations for their own sake is silly (from a practical robotics perspective, maybe not from a cogsci perspective). But I think that there's a distinction to be made between things which are at least as capable as human cognition and things which are exactly as capable. We have neither at the moment, so I think human cognition is an appropriate implementation to study. As you point out, this probably should serve as inspiration rather than specification.

      Delete
    4. I agree that the goal for robotic language should be building robots that "think rationally" rather than "think humanly", and I think the idea of creating a constructed language is a very good point. However, if we would like human to be able to communicate effortlessly with robots, eventually the communication would be made in natural language (in most cases, English). Therefore, after constructing an internal representation system that are suitable for robots, it might still worth considering how to project it to human language.

      Delete
  10. In the course we have looked at two methods of representing natural language: SHRDLU, which combines a syntactic analysis with semantic analysis (with some heuristics), and the Heim and Kartzer approach that looks to convert natural languages to truth values. Both approaches try to break a large NLP problem into subsets/ scenarios that can be solved reasonably, hence face scaling issues (when targeting a real world problem). The SHRDLU approach is far reaching from the point of robotics because as a platform the simulated system had the ability to evaluate and manipulate its environment, which is required in a real world robot. I would be interested in a machine learning perspective to see if we can generate a value function over which we force the robot to learn language? Not the sort of Chinese learning problem, but where active feedback is given to the robot to learn, along with heuristics. What level of heuristics and environment would still have to be decided.
    @Yujie Wan: If we throw a block wall of text at a program, how does it know what it has to replicate. Learning grammar rules in most languages have easy heuristics or a predictable model. If we throw that out, we would be trying to solve a larger problem, the feature space can be huge. At least with heuristics we can classify a word as a feature type (verb or noun or adjective). Plus there can be context related semantic problems. May be I am wrong here, but heuritics seem to ease the problem. I would also like a pure machine learning based formulation for language solving, but I think the combinations possible in natural language can be far more that those in object recognition (we named every object and its activities and their different order). I would like to know more about this too.

    ReplyDelete
    Replies
    1. From what I've understood, it seems like you're interested in using reinforcement learning to teach the robot semantic analysis. This is an interesting idea, and I believe it fits well with a model of how humans learn language. If a young child makes a grammatical mistake, they are corrected (or even punished) for doing so. Reinforcement learning would allow you to simulate this process, the robot would attempt to parse some text and would be punished/rewarded for its efforts. If this is what you are interested in pursuing it may be worthwhile to speak to Michael Littman on the matter.

      On the other hand, this doesn't exactly explain how young children learn semantic analysis. I would argue that at an extremely young age children might not be able to understand the connection between the reward/punishment and their understanding of language.

      Delete
  11. In Thursday's lecture, we discussed about semantics and language meaning. The Fregean Program helps us to form structures to identify the truth-conditions of sentence components. In order to do that, Fregean Program put forward ways to represent sentence components using functions. It divides the sentence into different parts and tries to form uniform functions for various words if they are of same kind. Truth conditional values that we get from context can be applied with Fregean Program. When the final outcome of the Fregean Program turns out to be not a truth-condition, the sentence is uninterpretable. In computational linguistic class, we learned about how to analyze sentences in components using dictionaries that have been created before. However ,it will take an hour or so to analyze a file with several thousand words. This might be too slow for building a real robot. How is Fregean Program used in designing robots? Are there any efficient algorithms? Since Fregean Program finds the truth-conditions based on context, we need to build context for robots. How do we do that? And what if the robot comes across some words that are not in the dictionary?

    ReplyDelete
    Replies
    1. It does take quite a long time to create a dictionary from training data. However, if we consider the situation where a robot is to be used, we may find that actually we are not likely to provide some long command to the robot. For the service robot, we may only need some short commands like "pick up the book", "wash the dishes" or some other verb phase. Maybe that is not the only case, but hopefully a robot can process such command quite quickly.
      However, when considering the processing time, the Siri is one example where one command may takes a robot some time to finish. Although it will not take a robot a long time to "understand" the question, I think the most time is spend on searching for "correct" answer. Also in the video we saw at the first class, although it can "understand" the short command quickly, I think robot spend a long time on object recognition and compute desired actions.

      Delete
    2. We can use data that we already know. And what should we do if we use robots in some less structured languages like Chinese? What is more, Chinese contains a lot of words that are of similar pronunciations and no . For instance, "wine" and "nine" are of same pronunciations. How can we tell whether the user want "nine bottles" or "wine bottles"?

      Delete
    3. I asked something similar in my post, where I question the connection between interpreted "meanings" and the real world. I'm under the impression that robots must have additionally pre-programmed modules for such interactions. Then sentences which are interpreted into truth conditions can be combined by relevant modules to mean something useful. In other words, something would need to be layered on top of the Fregean Program. For example, I'd guess that if you ask a robot what color a block is, a side module could cycle through colors, asking itself whether a block is that color. Then it may choose an answer depending on additional statements.

      This potentially means that in order for a truly versatile robot, a lot of components need to brought together for anything that the robot may do. I wonder whether we may one day create a true artificial intelligence which learns things completely outside of its programmed scope. So far, I've heard nothing of the sort.

      Delete
  12. I have a question: what are we supposed to learn from the previous readings and discussions? I don’t see how they can help us develop intelligent systems that could “understand” natural languages.

    SHRDLU was a nice attempt in the past but I don’t see how SHRDLU can be extended in a meaningful way that helps constructing smarter robots. It can only execute commands, which it has already memorized. This is the very opposite of statistical approaches that we are interested in.

    Mapping English sentences to lambda calculus is nice. But what do we do with it? It doesn’t really provide us much information. And I don’t see how lambda representation of English sentences could empower robots to “understand” English sentences.

    Are we supposed to learn what not to do when we are building intelligent agents? Or are we supposed to feel how hard it is to build such agents? Or what am I missing?

    ReplyDelete
    Replies
    1. I think the robot can use a dictionary to understand natural language. And we can use machine learning algorithms to help it. We can use maps to help robot find what it is instructed to do and the responses. Just like a reverse process of developing games. Developing games and other applications require us to analyse data and represent users instructions in vivid pictures or movements. Instructing robots is like get instructions from users, analyse them and compare with known data sets.

      Delete
  13. Hector Levesque gave a speech at IJAI later written up as "On our best behavior" [1]. In the paper he argues that a lot of AI is built on cheap tricks ("aka heuristics"). In short, things that get right answers for the wrong reasons. He gives an example prompt:

    "Could a crocodile run a steeplechase?"
    [ ] yes
    [ ] no

    This can be answered in several ways. The human approach would be something approximating a crocodile simulation. I know what a crocodile is capable of doing, approximately, and it's quite easy for me to determine that a crocodile couldn't clear hurdles.

    A cheap trick relies on a closed world assumption: if I can't find an example of a crocodile running a steeplechase, they can't. Note that this is something we saw in SHRDLU too: hypothetical questions about the world were answered only when there was an example of the hypothetical in the block world.

    Levesque proposes a better type of intelligence test, which he names a Winograd-schema test. There are many examples in the paper, which I won't include here. The essence is to exercise some expertise in disambiguating a question -- expertise in things like human behavior, material properties, appearance and so on.

    The tricky thing is that in many domains -- vision and language come especially to mind -- you need to know things to learn things. This makes things very difficult to get going, as there's a circular dependency between understanding and itself.

    I'd like to look at bootstrapping linguistic systems, where the primary objective is to provide some minimal (not necessarily small) understanding and use that to expand understanding via whatever means might be effective. Levesque does warn: "I do not think that we will ever be able to build a small computer program, give it a camera and a microphone or put on the web, and expect it to acquire what it needs all by itself." I agree. But I'd like to explore what it might take to do so theoretically, and try to make exploratory in-roads on understanding primarily with learned information.

    [1] http://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf

    ReplyDelete
    Replies
    1. I'd like to highlight something further. Many of you seem to express the opinion that true language understanding isn't worthwhile -- domain specific or functional / behavioral simulation is enough. I disagree. Granted, domain-specific language is useful, is relatively practical, and has yielded impressive results. But I think that true language understanding is not only a much more interesting problem, but a more promising one to solve. Useful behavior in individual domains, while hardly a solved problem generally, have been demonstrated. It's the ability to work in any domain is what makes human intelligence uniquely valuable. Finding a way to similar flexibility computationally would not only solve all instances of single-domain problems, but would additionally solve the problem of choosing, switching, and learning appropriate domains. The implications of that are much more than the sum of every useful single-domain application.

      Delete
    2. I'll play devil's advocate here. I acknowledge that domain flexibility is a critical and integral feature of human intelligence – I'm just not convinced that robots need this feature in order to fulfill their potential. To me, robotics has the potential to contribute massive amounts of utility to society due to the fact that robots are physically (and computationally) capable of feats that humans are not. They can exist in spaces we cannot (e.g. outer space, the bottom of the ocean), and can execute tasks that are simply not possible for humans. While robots like the Roomba are neat, I'm convinced that the ultimate end-game for robotics does not lie in mechanizing household tasks. So, in thinking about what this end-game might consist of (I'm curious what you think it might be), my question is, do we require genuine language understanding? If, as you suggest, achieving human-levels of language understanding will essentially solve all of the problems of each sub-domain (as opposed to employing more sophisticated 'cheap tricks' to try and accomplish domain-specific tasks), then it seems a no-brainer – let's solve all the problems with a single solution. However, I'm not convinced such a solution is as tractable as expanding the 'cheap trick' methodology, and it appears as though we might be capable of reaching the level of utility we want from robots that simply simulate language understanding with respect to their domain. I'd also like to add that I think there will certainly be leaps and bounds made within various realms of robotics (e.g. Learning/Vision/Knowledge Representation) that will contribute across all domains of robotics, I'm just not convinced that our goal ought to be achieving genuine language understanding, given the inherent intractability of the problem, and considering that we might not *need* those levels of language understanding to achieve what it is that robots ought to achieve (which is obviously up for debate). Let's discuss more about this tomorrow in class! This is a really interesting area to chat about..

      Delete
    3. Also - I really like the sound of Levesque's work. I looked into the Winograd-Schema test and it's quite neat. Do you think it is a sufficient test for determining genuine language understanding? Levesque provides a few examples in his paper "The Winograd Schema Challenge" (http://commonsensereasoning.org/2011/papers/Levesque.pdf)

      (1) The trophy would not fit in the brown suitcase because it
      was too big. What was too big?
      Answer 0: the trophy
      Answer 1: the suitcase

      (2) Joan made sure to thank Susan for all the help she had
      given. Who had given the help?
      Answer 0: Joan
      Answer 1: Susan

      One of his goals of the test was to design it in such a way that "having full access to a large corpus of English text might
      not help much", which is an interesting route to take. If this test isn't sufficient, could we offer a better candidate?

      Delete
    4. No doubt everyone would like true understanding of environment/ language by a robot, but the methods we have to deal with them are goal oriented right now. Consider computer vision, we have only now begun to look at the problem of inference in vision (like activity recognition). And we understand that there can be a huge corpus of activities, that like grammar can be added up. But we look to recognize a small set of activities not all of them together. It is how we start. Humans themselves can solve the Winowgrad's test only because of heuristics. We are not evaluating the sentence based on preset rules we know (like a regular councilman would not be jealous).

      Delete
  14. How can a robot understand users' commands and tasks and complete them in everyday life? For example, "put the red block on the green block". In SHRDLU's approach, this command is of the (#PUT X Y) type and it should ground into specific symbols. Assuming that there are some assertions already in SHRLU: (#BLOCK B1). (#COLOR B1 RED). (#BLOCK B2). (#COLOR B2 GREEN). So this command is grounded into (#PUT B1 B2) and is executed by its executor. In Heim and Kratzer's approach, the parse tree is (ROOT
    (S
    (VP (VB put)
    (NP (DT the) (JJ red) (NN block))
    (PP (IN on)
    (NP (DT the) (JJ green) (NN block))))))



    there should be corresponding lexicon for words "the" "red" "on" and semantics rules to get the meaning of the command. These lexicons and semantic rules are all manually written. So as there are more and more sentence patterns and words, it will need a lot of labor to write down the corresponding rules.


    If the system combines with the reasoning part which can make deductions about the current environment like in SHRDLU, much more handcrafted knowledge will be needed for the system. For the command mentioned above, if robot already has a blue block in its hand and there is a pyramid on the green block, system will make inference about what the robot should do to complete the command. The robot should first put down the blue block in its hand and then pick up the pyramid from the green block and put it on the table. At last, the robot can put the red block on the green block. So how does the system can make deductions like this? To achieve this, the system should have knowledge about its mental model(action model). It should know the preconditions and effects of each action. The robot couldn't pick up an object with some object already in its hand. So the precondition of the "pick up" action is that its hand should be empty. The effect of this action is that its hand has the object it picked up. With more and more actions incorporated into the robot, it is inevitable to manually written these knowledge to the system.


    Furthermore, let's consider a more complex situation. As in a household environment, the users may ask a service robot to complete various everyday activities such as setting a table, preparing meals or serving drink. But such tasks are extremely knowledge-intensive, so the designer of the robot can not program them all beforehand. For example, when the human tells robot to prepare a meal for him, how does the robot know how to prepare a meal? These knowledge should either be handcrafted or gain from somewhere else.


    In general, there are still many problems to implement a real robot that can serve people in household environment. But we can approach this goal by reimplementing a simple AI application such as blocks world using state-of-art techniques and methods.

    ReplyDelete
    Replies
    1. Shouldn't we be able to develop robots to acquire this kind of knowledge through reinforcement learning or classical planning? Reinforcement learning of course might take a very long time to learn a complex task and still often requires a good amount of human intervention to provide the actions and rewards the robot takes and receives. Alternatively in a small domain we could avoid having to wait for a RL process to take place by manually entering the relationships between various actions and having the robot run a planner. It seems like we already have these useful tools to help advance beyond block-world.

      Delete
    2. The main focus of SHRDLU was to enable a robot to communicate with a human to conduct the things that people told to do. This approach is a little different from the RL or other planning approach that the goal is already pre-programmed and the robot's job is only to find what actions to take to achieve the given goal. Instead of a robot trying many different actions to take, it may question about the problems of ambiguity it is facing or ask for help. Therefore, making a robot do house-chores has to do with enabling robots to understand its surroundings, learn specific motor skills, understand people's commands or questions, and reason these commands and questions to chose what actions to take or what questions to re-ask.

      Delete
  15. The real problem with the Heim and Kratzer reading is that it's all about context-free grammar and the title of the class is about grounded language. Grounding implies context awareness, not context-freedom. If grounding was as simple as context-free language plus context, wouldn't there be some better examples of it available by now? SHRDLU has some context, but ultimately it is still just manipulating symbols that, to the computer, are void of further meaning. This doesn't happen with people.

    Every noun I use conjures up a fairly rich set of associations, some of which have to do with a widely-shared sense of the word's meaning, while others are more private. For lots of nouns there is personal experience behind them, along with the definitions in terms of other words. A lemon isn't just a "yellow citrus fruit" but is a fruit I've picked, tasted, observed, spit out, and laughed over. Abstract nouns like "justice" conjure up meanings in terms of disputes I've had with my sister, as well as stories of what happens in courtrooms, as well as a freshman philosophy class spent dissecting Rawls. This is the grounding behind the language I use. Robots, because they exist in the real world, promise the possibility of a similar kind of grounding, but I doubt we'll get there via the same roads that brought us context-free grammars.

    ReplyDelete
    Replies
    1. If you wanted to set some context for word meanings, the most straight forward approach would be to use a large set of dictionaries and thesauruses to map words to other words or phrases. Then furthermore, you could create statistical relationships based on how closely they are set in large bodies of writing. But this sort of creates a intractable set of problems for common words. Words with too many meanings will be too connected and actually make it impossible for our agents to set some specific definition. For example, the word 'set'.

      I think I set some sort of record here, maybe the sun just needs to set on my comment, or perhaps I just set in motion a terrible, terrible comment thread.

      Delete
    2. Please set aside the set of objections that set up dictionaries and thesauri as final arbiters of meaning. A dictionary is a set of definitions that set out a common denominator of meaning for people who already speak the language. A thesaurus is a set of sets of synonyms that set... ok, enough of that.

      The point is that neither dictionaries, thesauri, nor (ahem) sets of statistical correlations grounds a symbol in a, well, set of experiences, feelings, or impressions. If you've ever watched a six-year-old, proud of her ability to read, consult a dictionary to find a word meaning and have no idea what the definition meant, you'll know how inadequate they can be to specify a meaning. Merely specifying a dictionary just creates an infinite regress where words are defined only in terms of other words. There has to be something else behind the symbol.

      For SHRDLU, it was the functions that made up the word meanings. Ultimately this wasn't as satisfying as it should be because the functions themselves operate on LISP symbols, which aren't really grounded either. But it did mean that the English words were grounded in LISP symbols, which seems a step forward out of the infinite regress trap.

      I think a lot of the issue is a real lack of clarity about the goal. We say we want a machine to understand a sentence, to find its meaning. But I think these two words don't have very precise definitions, allowing people to think they've satisfied them with things like lambda functions.

      Delete
    3. I agree, a machine has no set of goals to achieve aside from what it is being asked to do. Machine learning and reinforcement learning are clear cut setups, we have a goal, which we try to optimize. To make a robot do a task there have to be overall goals, which speech is just a part of. Example, if I ask a robot to pick a Lego block the goal is not that the robot understood me, the goal is that the robot finished its task of picking up the block. For this everything in the goal setup must be defined, including what a block (the word) is, and where must it be found. It makes no sense explaining the robot what Marxism is to achieve this task or make it memorize the whole dictionary, and complicates the setup. Humans gain knowledge slowly, not all at once.

      Delete
  16. I'm floating a couple of ideas for final projects. The first is the question of how to provide additional unprompted context to robot commands or questions. Just by observing an agent is behaving in an undesired manner, users should be able to correct specific actions of the agent. Depending on which chunk of the command the agent is completing, a correction or update would carry different meaning. A sentence describing to pick up a block and put it on another block, requires two actions. If during the first, the user wanted to pick up a different block, he should be able to say, "No, the (other/red/closer) block", and expect the agent to try to remap the meaning. Whereas, if the robot was placing the block, that same sentence would carry different meaning and thus require different actions.

    My second problem to explore, nascent though it may be, is how to map words to different domains. We say in class that limited language domains are required in order to create tractable understanding algorithms, but what if the system could switch its active domains based on the context of the language. If for example you were giving basic robot commands in the block world, but wanted to ask the robot a completely unrelated question, you could give enough context to make that question completely unlikely in the current domain, but statistically very likely in other domains.

    ReplyDelete
    Replies
    1. For your first idea, it seems like SHRDLU already does something very similar to this with its context recognition. The only difference is that since SHRDLU is text based it finishes executing or asks for clarification rather than acting continuously as a robot in the real world would. The cases you are describing are similar to the scenario where SHRDLU is told to "Put the red pyramid on the blue block" when there are two blue blocks. In SHRDLU, it would immediately ask for clarification, where as in the system you describe, it might just pick the closest blue block and then allow for interrupted clarification. This could then be interpreted as if SHRDLU had waited on the action and asked for clarification instead. Additional context could easily be added by the current action (picking up vs. setting down, etc) in the same way that context was applied to "one" and "it" in SHRDLU. This seems like a very reasonable way to extend SHRDLU's functionality into the real world, and would also be much more helpful in the real world where the entire world may not be known (such as a block obscured by another object and therefore not known to the robot).

      Delete
  17. I'll float a few final project ideas. One would be to implement and explore the effectiveness of a knowledge representation engine. To be successful, it would probably have to rely on a number of different representations. Images (and perhaps 3d models), perhaps some notion of (fuzzy logic?) predicates, and perhaps probabilistic models such that entities are probabilistically linked with properties, associated concepts, and so forth. This might be effective restricted to a very specific subset domain, but it could be made much more powerful if it could learn new concepts or even ask clarifying questions to a human interviewer. The problem with a system such as this is that all parts of this system are 1. immensely complicated 2. seem very important, and 3. necessary, such that I imagine it wouldn't work nearly as effectively without any single component. It seems like a large undertaking, but might be made much more effective by restricting to a small subset, such as perhaps recipes to make cookies. On the other hand, I have a strong intuitive feeling that a large part of successful natural language processing is successful knowledge representation.

    Another project possibility would be a replication or extension of that 'freshman tour' robot that we saw. I'm also interested in autonomous locomotion, so I found such an overlap between NLP and locomotion to be really interesting. The advantage of this project is that the natural language component seems more simple than the former project. On the other hand, there's all the locomotion stuff to figure out.

    A third possibility that comes to mind is some type of computer voice control over say, web browsing. It would be an amazing interactivity model to be able to talk to web pages or your browser and say commands like, 'Put this link on my Facebook wall'. I know that Google Chrome just recently included some kind of Voice Interaction API in the latest versions of Chrome, and exploring the new interaction models that it affords would be really cool.

    ReplyDelete