Tuesday, November 26, 2013

Tree Substitution Grammars

The reading for Tuesday 12/3 will cover tree substitution grammars:
  • Cohn, Trevor, Sharon Goldwater, and Phil Blunsom. Inducing compact but accurate tree-substitution grammars. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009.

By Sunday 12/1 at 7pm, please post a question about the paper.  By 7pm on Monday 12/2, post an answer to somebody else's question.

40 comments:

  1. 1. In the figure 1 example derivations for the same tree, the V->V, NP->NP rules only generate one tag as child tree, and the probability is 1, why do we need that? What if we derive from NP->George, V->hates directly?
    2. In page 550 P0 generates each elementary tree by a series of random decisions and we use random sampling order in Gibbs sampler. Why do we need randomness in the algorithm?

    ReplyDelete
    Replies
    1. I think the example is pretty trivial, but aren't there generally multiple ways of parsing complicated sentences. Evaluations about whether sentences make sense with different parse trees need to happen with different organizations of trees.

      To answer your second question, I think that expanding this parse tree for all possible parses becomes intractably large, and exploring every possible node impossible. However, by using a random sampling method with biases toward the tree parses that are used frequently in the training data, it helps guarantee that the algorithm will generally converge toward the optimal solution without getting stuck at any one node.

      Delete
  2. Does the simultaneous training of hyperparameters lead to large overfitting problems? They seem to think it doesn't have a large impact, but it seems like you generally want to avoid learning hyperparameters simultaneously. Is there evidence that this isn't really a large problem? They chose to make the tradeoff because determining the hyperparameters by hand is time consuming, but I'm wondering what the overall impact of this is.

    ReplyDelete
  3. They sample hyper-parameters, alpha and beta, because selecting them is difficult. But how do they choose parameters 0.001, 1000 for gamma distribution and 1, 1 for beta distribution? Isn't selecting these parameters as difficult as selecting alpha and beta?

    Does sampling hyper-parameters lead to better performance than hand-picking hyper-parameters?

    ReplyDelete
    Replies
    1. I do not think there is a good way to pick them at all. This has to be the easiest way to do it.

      Delete
    2. Plot Gamma(.001, 1000) and Beta(1,1)

      http://www.wolframalpha.com/input/?i=gamma+distribution+.001+1000

      http://www.wolframalpha.com/input/?i=beta+distribution+1+1

      Note that these distributions are very wide (i.e. have large entropy / uncertainty, i.e. mathematically close to the uniform. (Beta(1,1) _is_ actually the uniform.)

      Basically, they tell the computer, alpha and beta might be anything, just go and figure out what you think the best values are.

      To the question, isn't selecting these parameters as difficult, the answer is no, selecting priors on hyper parameters is a fair bit easier. For one, you almost always want to use a uniform or similar low-entropy distribution, and second, this prior does not affect the converged-to value, only the rate of convergence. So you'll still _get_ whatever the optimal hyper-parameter is (or at least a local maxima), you'd just have to wait longer for it. Contrast that with selecting the hyperparameters themselves, and you might pick them correctly, or you might pick them incorrectly.

      Delete
  4. I don't quite understand why there's inconsistency problem and how the paper resolves it. Also, in the experiments section, MPP is not systematically better than MPD. Why did they conjecture the reason is the sample variance of long sentences and rare repeated samples of the same tree? How do they affect the result?

    ReplyDelete
    Replies
    1. My understanding is that previous approaches learn some rules e from t even though e is not a part of t. They resolve this problem by having the likelihood p(t|e) be 0 when e is not a part of t. The inconsistent rules get assigned 0 probability. Then their model only learns the consistent rules.

      Delete
  5. I feel confused about that in page 550, why author said they can generate "ei" by drawing from a cache of former expansion of c and they can use the number of times the expansion is used before to calculate the probability? I think if they use this way to calculate the probability, the result can be any number between 0 and 1 rather than either 1 or 0.

    ReplyDelete
    Replies
    1. They are talking about the probability distribution (2), which is between 0 and 1. The likelihood p(t|e) is either 0 or 1.

      Delete
  6. In their discussion of the full treebank on pg 554 they mention that their model falls below the Berkely parser (state of the art) because the Berkely parser has been tuned for the Penn treebank specifically. They also say that they anticipate being able to improve their results by tuning the model and preparing the data. My question is whether that is a good thing. I think it is a benefit that PTSG works without having to prepare the data or being changed for each data set. How well does the Berkely parser do on other data? Should the state of the art require data prep and tuning?

    ReplyDelete
    Replies
    1. It's interesting that they left the paper with a "parsing algorithm that could lead to state-of-the-art performance in the future" -- perhaps they agree with you.

      Delete
    2. I also found that a bit odd.. In order to do a true comparison I suppose they would need to perform similar data prep, but I wish the comparison metric was more principled. Nakul (I believe) mentioned below that the Berkeley parser achieved 88%, which is actually quite a bit higher than the PTSG. I'm curious as to what the PTSG would achieve (purely for the sake of comparing to the Berkeley Parser), but I definitely think that getting decent results without having to clean your data is a big win

      Delete
  7. I'm a bit confused as to what exactly distinguishes a CFG rule from a non-CFG rule, and what sort of non-CFG rules the model learns (i.e. I didn't find section 7 + figure 4 that enlightening).. At the beginning of the paper they mention that CFGs suffer due to independence assumptions enforced by the context free-ness of the CFG - how do TSGs avoid being context free, and intuitively, what sort of rules should we expect to see as a result?

    ReplyDelete
    Replies
    1. This is also what I was wondering. The examples it gave on page 7 state that the non-CFG rules tend to be longer or more complex. It seemed to me that section 7 was trying to point out complex linguistic rules which were being captured by the larger substitution trees such as "multi-word units, subcategorisation frames, and lexical agreement". My guess is that CFGs can in fact represent these rules, but the given TSG model learns these linguistic rules through the creation of many "CFG ones". I also believe that when it mentions "CFG rules", it's describing typical grammatical rules (based solely on hard grammar) such as NP -> NN VP, and not complex language/speech-based rules.

      Delete
    2. A CFG works simply one level at a time, meaning every rule has the form
      non-literal mapped to some string of non-literals and literals. An example of this would be: S -> N VP (Starting state to noun and a verb phrase). A TSG allows for more depth, such as: S -> N (said S to N). While it may seem that these two can be made equivalent (which I think they can, you could simply pull all the pieces up to the same level, having a CFG of the from S -> N said S to N), this destroys the grammar tree by removing it's tree-ness and making it a single layer, rather than maintaining the two branch tree that grammar trees have. TSG's maintain the two branch grammar tree structure while still considering word dependencies for more than one level down.
      Figure 7 shows the multi-level structure. Each replacement of a non-terminal has two sub branches (represented by the sets of parens), but inside each of those branches some other structure is often stored as well. For example, the Adjective phrase non literal often maps to [(adverb too) adjective], rather than just two non-literals, which could fail to encode the prevalence of 'too' in this grammar.

      Delete
    3. I think TSGs are weakly equivalent to CFGs, i.e. they are both capable of producing the same string languages. The difference, as I understand it, is that TSGs are not strongly equivalent, i.e. they can generate a different set of structures from a CFG. (That said, the paper and the internet disagreed on some of these definitions so I may be confused.)

      My intuition here is that a TSG can represent grammars which would take many (an exponential number of?) terms in a CFG. It makes sense to me that this would be more expressive & therefore easier to learn, but I wish I understood the implications more precisely.

      Delete
  8. In a way, this seems to be a way of handling CFGs with larger scale rules, or tree substitutions. Beyond this disparity, how else does the algorithm the paper describes differ fundamentally from CFGs, or is the first sentence (broadly) correct?

    ReplyDelete
    Replies
    1. I believe you are correct in that a PTSG is very similar to CFG's. In the introduction, they describe a PTSG as "an extension to the PCFG in which non-terminals can rewrite as entire tree fragments."

      Delete
    2. As I understand it, the difference between P/CFG and P/TSG is that the CFG can only look at a single "node"<->"subtree root" connection in a tree, (hence, context-free), whereas the TSG can look at more than one connection or level of the tree at a time (hence, tree substitution). For example, say a subject wants a VP (verb phrase) as the rest of the sentence, then any tree with VP at the root could plug into that according to the CFG model. However, the TSG model, since it can operates at multiple levels, could take any tree with VP at root, but then also guarantee that it has a PP (prep. phrase).

      I think this means that P/TSG should be strictly better than P/CFG.

      Delete
    3. In my opinion, PTSG is more like PCFG. A PCFG is a probabilistic version of a CFG where each production has a probability. A PTSG is an extension to the PCFG in which nonterminals can rewrite as entire tree fragments , not just immediate children.

      Delete
    4. PTSG is based on PCFG which nonterminals can rewrite as entire fragments, as mentioned in the introduction. Moreover, these large fragments can be used to encode non-local context. And they need to induce the PTSG productions and their probabilities in an unsupervised way from an ordinary treebank, because no annotated data is available providing TSG derivations

      Delete
  9. Like Lauren, I wanted more information about how they would tune this parser to work on the Penn Treebank better. How are the other parsers tuned to take advantage of that dataset? They also suggested improvements could be made to match the state of the art parsers, but I missed what those improvements could have been and how they would've improved the parsing. What situations would their tuned algorithms solve, and how much of a percentage boost would they expect with that dataset?

    ReplyDelete
    Replies
    1. I think this is an important point, it is difficult to compare this parser to other highly tuned ones (ie. the Berkeley parser). I feel that tuning it to the penn treebank would give me a better idea of this parser's relative performance

      Delete
  10. I'll pose two questions, one specific to the paper, and another more general about parsing.
    1. In Table 1, the PCFG is shown to have 3500 rules, and the P_0^M TSG to have 6609 rules. However, in Discussion (section 7), they note that this TSG had 5611 CFG rules and 1008 TSG rules. Why can't the PCFG model capture more of the CFG rules that the TSG finds? Perhaps it's a moot point since TSG discovers more rules.. but it seems like a weird result.

    2. I wonder what inter-rater agreement is on the Penn Treebank, to calibrate expectations for an upper bound on parsing performance. In playing with Stanford Parser throughout this semester [for another class] (correct me if I'm wrong, but I believe it uses PCFG), I'm yet to see a case where it's given me complete bogus results. Therefore, I'm wondering, how close are we to good-enough / as-good-as-can-be-hoped parsing performance, and how much do we actually gain with this new parsing methodology? (I'm sure there are cases where Stanford Parser messes up, but I haven't found them.) Moreover, when PCFG or TSG parsing fails, does it give not-quite-perfect-but-still-usable results, or complete bogus ones?

    ReplyDelete
    Replies
    1. I vaguely remember inter-rater agreement being somewhere in the high 90s? (Which does make parsing more or less solved, I believe.) I think Eugene cited it in CS146. He would probably know the details, at any rate.

      Delete
    2. How does that mean parsing is solved? If 77% is state-of-the-art according to this paper (granted, from 09), then there's still ~18 percentage points to go.

      Delete
    3. I think they cite the Berkley parser at 88%.

      Delete
    4. About 1) PTSG just seems like a non parametric way to represent rules, and I believe they still have not figured out a nicer way to train this form to get results. This is a much bigger space to train on, and hence is more difficult to train, but would certainly have more rule representations, but train them to those that CFG have exactly is hard.

      Delete
  11. They talk about using CYK to parse but I don't know if I understand how exactly bottom up parsing works for a tree substitution grammar. It seems like a very PCFG-centric algorithm. How does this actually work? They don't explain very much in this little section.

    ReplyDelete
  12. In the paper, the author didn't make clear what is inconsistency problem. What is consistent in P550? "P(t|e)is either equal to 1 (when t and e are consistent) or 0 (otherwise)". Does it mean t and e are the same?

    And I don't understand how they compute the base distribution Pc0. "PC0, has a similar generative processbut draws non-terminal expansions from a treebank-trained PCFG instead of a uniform distribution. " So how does Pc0 draw expansions from treebank-trained PCFG?

    ReplyDelete
    Replies
    1. My understanding is that t and e are not the same. I think for a parse tree t, there could be multiple sequences of elementary trees e that generate the same t.

      Delete
    2. I think 'consistent' means that from e you can derive t, if t is not derivable the probability of having t given e is 0 and given a derivation of a tree (which is e), your tree (t) is determined therefore P(t|e) is 1.

      Delete
  13. I am also interested in their statement about Penn-treebank. They states that "careful data preparation and model tuning could greatly improve our model’s performance". What kind of data preparation could be used and are there any examples?

    ReplyDelete
    Replies
    1. They didnt't quite mention how they prepared P_0^C using a pre-learnt PCFG. Maybe they meant using a differently tuned PCFG models and testing w/ different alpha and beta values.

      Delete
  14. I'm confused about figure 3 and 6.1: "These trees cannot be modeled accurately with a CFG because expanding A and B terms into terminal strings requires knowing their parent's non-terminal." I'm not sure what they mean by accurately (CFG can only generate a superset? CFG requires many rules?), and I'm not sure I understand what they mean by "knowing their parent's non-terminal" at all.

    A small note: what does it mean for something to be accepted into Metropolis Hastings?

    ReplyDelete
    Replies
    1. I'm not sure yet exactly what the notation they used in the figure is supposed to mean. It just looks like a CFG. I found a review paper (Joshi) of TSGs/TAGs and I don't see this notation.

      But I think "knowing their parent's non-terminal" is because we end up with an internal node B which leads to terminals of a or b and the TSG is constructed so that a B which is a child of A is substituted with a B with children that are a's.

      So if you look at this as just a CFG you can't make that guarantee that B's children will be a's because it requires that context of the parent.

      Delete
  15. It seems to me that it is novel in that it does not require predefined parameters for learning the parse rules, however, there is still two parameters remaining alpha & beta and also requires pre-learned CFG for P_0^C for better performance. If it is so, I'd like to know what tuning has to be done in other models and how complicated the tuning processes are compared to the PTSG model presented in this paper? Also, if it needs learnt PCFG data for P_0^C, as a whole process it doesn't seem like a simplification of the existing method.
    I didn't quite understand Figure 4. I do not know what x axis and y axis mean. Could someone answer what these axes are?

    ReplyDelete
  16. Because they are using hyperparameters in their model I would be interested in seeing how the selection of hyperparameters affects their results. For example, how bad would it be if their selected hyperparameters obtained a local minima instead of a global one? In addition, recently in class the topic of human cross-referencing accuracy came up. I think it would be interesting to compare the accuracy of this model to that of a human, and not a "gold standard".

    ReplyDelete
    Replies
    1. human accuracy in comparing a parse tree and a sentence (or a paragraph) is about 90% as in the paper.

      Delete