Friday, November 22, 2013

Deep Parsing

 This week we will read a paper on deep models for parsing:
There is a lot going on in this paper.  By Sunday night at 5pm, post a question about something in the paper.  By Monday night at 5pm, post an answer to somebody else's question. 

There is no class on Thursday.  Happy Thanksgiving!

40 comments:

  1. Since the paper doesn’t specifically mention robotic, my question is how would you apply sentiment analysis to robotics? I think one way would be to include sentiment analysis when interpreting natural language to know whether a human collaborator is happy or angry and therefore whether the robot should continue what it’s doing or switch actions.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. As Stefanie and Kurt mentioned in last class, one application of robot on natural language is let robot help us respond email or some message of social network. By using sentiment analysis, robot can know whether the message is sad or happy, and based on these information, robot can make a good response.

      Delete
    3. This would be really helpful for natural sounding responses, but you could also use it for more meaningful understanding of requests. For example, a robot could ask "Would you like some cookies?" and get the response "I'm not very hungry", instead of a plain yes or no. Sentiment analysis would be helpful in determining if phrases that do not directly answer a question are affirmative or negative, which could probably make interacting with a robot more smooth.

      Delete
  2. When compute the probability to classify the new phrase into five classes, the equation uses softmax, instead of argmax as we usually use in models like POMDP. Why softmax ?

    ReplyDelete
    Replies
    1. The output of argmax is the _argument_ that produces the max. Softmax is more equivalent to a regular max. Specifically, softmax is usually the (normalized) sum of negative exponentials. (http://en.wikipedia.org/wiki/Softmax_activation_function especially section Smooth Approximation of Maximum)

      Softmax's primary advantage over the max function is its differentiability.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. When initialize word's matrix of RNN, why do we need to add a small amount of Gaussian noise? Another question is, to my understanding, the way getting the phrases is the same way like getting n-gram, is that right? Are they the same stuff or they are different?

    ReplyDelete
    Replies
    1. It initializes matrices as X = I + epsilon , i.e., the identity plus a small amount of Gaussian noise. Although the paper of MVRNN doesn't mention it in detailed, I think the reason is to avoid the value to be zero so that it won't be meaningless in the future calculation.

      Delete
  5. Q1: Why is this called 'deep' learning?

    Q2: How exactly does the tensor product work? Why does adding this have a big impact? The paper claims that "the tensors can directly relate input vectors", but why does this give the desired effect (answering this question from the paper: "Can a single, more powerful composition function per- form better and compose aggregate meaning from smaller constituents more accurately than many in- put specific ones?") and how does this correspond to the desired composition?

    ReplyDelete
    Replies
    1. Deep learning is basically neural networks rehashed, except at each level of the network you do some sort of unsupervised classification (cluster the output features at that level for similar inputs some how), this seems to improve the results at the output node significantly (I do not know why). I can't see where they are doing this unsupervised step in this paper.
      The tensor product itself is this huge function that allows for two input words to interact directly at different levels ( like at at V(1) then at V(2) ...until V(d)). This means you have many new features instead of just the few you had for MV-RNN. They learn this big V matrix for pairs of data they see and improve it using backprop to get the output. It seems to beat MV-RNN by using a single huge tensor instead of many small matrices for each word. The tensor allows more mixing between the words.

      Delete
    2. So deep leanring is to learn a multi-layer abstraction. Lower-level concepts can represent high-level concepts. For example, Google Brain Team learned a high-level concept 'cat' from images in Youtube videos. This paper only uses one layer of representation. It does unsupervised learning on this layer.

      Delete
  6. It is nice that the RNTN model can accurately classify positive/negative sentiments of phrases. Especially handling the negation of positive/negative phrases is really nice.

    Troubles::
    I don’t really get how the learning is done (what is going on in section 4.4?).
    I am having trouble of picturing what tensor is. It looks like a kind of a matrix...

    Softmax:
    http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression has a nice explanation of softmax regression that was used in the paper.

    ReplyDelete
    Replies
    1. You can look at their tutorial at http://nlp.stanford.edu/courses/NAACL2013/ which shows how the tensor looks. The learning is based on backprop for sure but I can't understand equation (2) for E(theta). Using back prop they are learning each parameter within each matrix within the large tensor using simple gradient descent rules.

      Delete
    2. The learning is done using RNNs and deep learning methods. The deep learning methods try to combine different components of the system. In section 4.4, the researchers assigned different values to estimate the probability of each branches, and compute backprop of each diverging node.

      Delete
  7. Q1: not really answerable by us I think, but I'd like to see examples where the system fails. What are the failure cases?

    Q2: Somewhat similar to Dave's question - I feel a little bit fuzzy on the precise meaning of everything in the tensor pipeline. One specific question is: they list a problem with the MV model as the explosion of number of parameters (seems to be d x d + d per node, and its dependence on vocab size). But then V seems to be 2d x 2d x d, and similarly dependent on vocab size.

    (sub Q2): Just confirming my understanding, but isn't d the size of vocabulary? I wonder what their specific number was. Unless I mis-understand, it seems like those matrices would be really really large.

    ReplyDelete
    Replies
    1. | V | is the size of the vocabulary, d is the size of the feature you pick. Since two vectors of size 2d on each side of V^(i) you have 2d x 2d matrices d times, to make a larger tensor. I have replied my understanding to to the other bit as a reply to Dave's question. It is odd that Q1 is not answered in the paper. There is their larger presentation at http://nlp.stanford.edu/courses/NAACL2013/ if you want.

      Delete
    2. The comments in the live demo page (http://nlp.stanford.edu:8080/sentiment/rntnDemo.html) contain several failure cases, some of them quite interesting.

      Delete
  8. Why does the RNTN account so well for the Negated Negative and Negated Positive examples, while the MV-RNN, which is very close to the accuracy of the RNTN on most other examples, fails miserably? It's fairly easy to see why bag of words models would fail on these cases, but I'm not as clear as to why some of the more complex models fail.

    ReplyDelete
    Replies
    1. The difference seems to be in the compositional function for the RNTN which the MV-RNN does not have. MV-RNN just uses matrix operations with a final softmax for classification. The "tensors" in RNTN better capture the influences of children nodes on a parent. Tensors seem to be more complex matrix representations of word compositions.

      Delete
  9. I am also confused about the accuracy gap between MVRNN and RNTN, which factor make this gap so large though MVRNN has high space complexity? Also, I tested 'the movie is not bad' and 'the movie is not too bad' in the live demo. I am not sure if the design handles 'too' but I guess not because the sentiment classification of 'too' is neutral. However, the classification of nodes 'too bad' and 'is not too bad' look a little weird: why 'too bad' tends to be neutral but then 'is not too bad' tends to be negative? How does the system do that?

    ReplyDelete
    Replies
    1. Interestingly, "it is not bad" comes out slightly favorably, and "it is not too bad" comes out slightly negative. These phrases are heavily dependent on context and tone in conversational language though. It is doubtful, that you would see them by themselves. How I understand, "it is not too bad" is basically it's a less severe version of "it is bad". However, these latter two sentences aren't significantly different in their results. They are both just negative. I would expect "It is bad" to be very negative in terms of results.

      Delete
  10. Is this algorithm useful for differentiating other aspects of language? Sarcasm or stubbornness are too contextual for short phrases, but perhaps the algorithm would be able to detect bitterness or humor? I also wonder how this could expand into categorizing sounds around language, like pauses, which could be added into the parse trees.

    ReplyDelete
    Replies
    1. Yeah I think that's a nice extension of the algorithm - it certainly appears as though other basic qualitative statements about utterances should also be inferable.

      I'm curious about the general problem of incorporating phonetic emphasis into textual natural language via some annotations/symbolic markup (like trop/cantillation in singing the torah). With this information (and things like pauses, as you suggest), I think we could start to infer a pretty long list of qualities to textual utterances

      Delete
    2. I am working on the irony detection project. Currently we are building a corpus for the project. It seems that you need contextual information along with phrases to detect ironies. I would say no matter what models you are using you need contextual information to detect ironies.

      Delete
    3. Humor might be interesting to think about but its so much more subtle than sentiment that I'm not sure the same approach would be useful. We can assign individual words some sentiment values and build up but humor doesn't really come about until you look at everything. I'm taking a class on Twain so we talk a lot about where the humor comes from in some of his writing and a lot of it has to do with how its delivered and what background you have coming in. I feel like humor recognition would require some kind of modeling of a user's understanding because humor usually requires some kind of jolt from what you expect to happen in a given context to what actually ends up being said and the fact that things don't line up with your expectation is a source of humor. So it's not really just a feature of words so I don't think this be very applicable.

      Delete
  11. 1) How is Eq 2 working? It does not look like comes directly from the KL divergence but has some other funny working and the parameters of lambda and theta have not been talked about.
    2) Most of the backprop is talking about improving the model itself, where is talk about improving the word vectors which were chosen randomly? Or am I understanding this completely wrong.

    ReplyDelete
    Replies
    1. 2) The inference step finds "optimal" values for the parameters, theta = (V, W, W_s, L). L, the matrix of word vectors, is set to "optimal" values during the inference. I cannot tell you more than this because I don't really understand the inference section.

      Delete
    2. They mention learning W and V but not L, which was promised at the start of the paper.

      Delete
  12. Q1: I'm also confused about the tensor. What are slices inside a tensor? And the paper said the composition function for all nodes are tensor-based and are the same. So are the slices in the tensor same too?

    Q2: I'm confused about the tensor product equation(hi). The first matrix of hi is 1*2d, the second matrix(Vi) of hi is d*d and the third matrix is 2d*1. How could they multiply each other? Because the dimension of the each matrix does not match.

    Q3: Why does RNTN performs so well in the negating sentences? How does tensor impact these types of sentences?

    ReplyDelete
  13. I am also confused about the parameter d. It seems that it is the number of features, but what is the appropriate value of it? Does it have something to do with the size of available data set? And how would the value of d affect the performance of the model?

    ReplyDelete
    Replies
    1. Good question, I went back through the paper to try and find out, but I don't see where they give a value for d either. They define d as the dimension of the feature vector, but do not give a value.

      Delete
    2. With the given input, the optimal performance achieved at word vector sizes between 25 and 35 dimensions.

      Delete
  14. The approach seems to be very interesting since it goes beyond just using a bag of words model and actually use a sentence structure to acquire sentiment judgement.
    My question is also similar to above questions. How positively or negatively labeled sentences can help human robot interaction? To me, It can be helpful more for a long term interaction than a short term interaction.
    With many people above, I am also confused about the tensor and how it can be separated. It seems important since in the learning phase, they learned separately.

    ReplyDelete
  15. I really want to know how well humans do on the same task, and how much agreement they saw between the 3 judges. 85% seems reasonably good, but are the 15% really ambiguous? Or does a human always get them right?

    ReplyDelete
    Replies
    1. I was wondeing this also, especially considering the amount of training time (I think they said 3 to 5 hours), the complexity of the system involved (I can't say as I really grok most of the mathematics involved here...).

      Delete
  16. Not really a direct question, but this is a pretty complicated system and we end up about 10% better in the best case compared to a simple-minded vector averaging approach so was it all worth it?

    Sentiment analysis is useful but a pretty specific problem with a binary or single-dimensional answer. Does this approach generalize to other language questions like ones relevant to grounding meaning beyond vague concepts like sentiment?

    ReplyDelete
    Replies
    1. You sort of have to look at it the opposite. How many cases are they not getting correct. As state of the art systems get closer to 100%, the gains become more incremental, but they are still important because they can significantly reduce the rate of failure. Increasing effectiveness from 80% to 85% only get's you 6% improvement in number of successes. But it decreases the number of failures by 25%.

      I think it would be difficult to get generalized actions and meanings out of it. It required a lot of training data that solely mapped sentences to positive and negative sentiments. What kinds of labels could you put on training data that a system like this could get many examples of to map to generalized meanings.

      Delete
    2. I was thinking along the same lines: why go to all the trouble of building a large corpus exclusively on movie reviews? I suppose it must generalize to some extent, I just don't see any reason not to throw in some other kinds of text. I'm probably missing something.

      Delete
  17. I'm not sure if this is the right thing to post -- I just had a terrible time trying to walk myself through the mathematics in this paper. I found that I was able to understand the general idea. The system uses a very fine-grained dataset of correctly parsed sentences annotated with parse trees and sentiment information in order to classify new statements. I'm unfamiliar with the motivation and general idea of "neural" systems and I'm wondering if anyone has some good resources with which to start?

    ReplyDelete