CSCI 2951K: Topics in Grounded Language for Robotics: Tree Substitution Grammars

Tuesday, November 26, 2013

Tree Substitution Grammars

The reading for Tuesday 12/3 will cover tree substitution grammars:

Cohn, Trevor, Sharon Goldwater, and Phil Blunsom. Inducing compact but accurate tree-substitution grammars. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009.

By Sunday 12/1 at 7pm, please post a question about the paper. By 7pm on Monday 12/2, post an answer to somebody else's question.

40 comments:

UnknownDecember 1, 2013 at 11:14 AM
1. In the figure 1 example derivations for the same tree, the V->V, NP->NP rules only generate one tag as child tree, and the probability is 1, why do we need that? What if we derive from NP->George, V->hates directly?
2. In page 550 P0 generates each elementary tree by a series of random decisions and we use random sampling order in Gibbs sampler. Why do we need randomness in the algorithm?
ReplyDelete
Replies
UnknownDecember 1, 2013 at 12:07 PM
Does the simultaneous training of hyperparameters lead to large overfitting problems? They seem to think it doesn't have a large impact, but it seems like you generally want to avoid learning hyperparameters simultaneously. Is there evidence that this isn't really a large problem? They chose to make the tradeoff because determining the hyperparameters by hand is time consuming, but I'm wondering what the overall impact of this is.
ReplyDelete
Replies
UnknownDecember 1, 2013 at 12:13 PM
They sample hyper-parameters, alpha and beta, because selecting them is difficult. But how do they choose parameters 0.001, 1000 for gamma distribution and 1, 1 for beta distribution? Isn't selecting these parameters as difficult as selecting alpha and beta?

Does sampling hyper-parameters lead to better performance than hand-picking hyper-parameters?
ReplyDelete
Replies
Lixing LianDecember 1, 2013 at 1:40 PM
I don't quite understand why there's inconsistency problem and how the paper resolves it. Also, in the experiments section, MPP is not systematically better than MPD. Why did they conjecture the reason is the sample variance of long sentences and rare repeated samples of the same tree? How do they affect the result?
ReplyDelete
Replies
UnknownDecember 1, 2013 at 2:28 PM
I feel confused about that in page 550, why author said they can generate "ei" by drawing from a cache of former expansion of c and they can use the number of times the expansion is used before to calculate the probability? I think if they use this way to calculate the probability, the result can be any number between 0 and 1 rather than either 1 or 0.
ReplyDelete
Replies
UnknownDecember 1, 2013 at 2:48 PM
In their discussion of the full treebank on pg 554 they mention that their model falls below the Berkely parser (state of the art) because the Berkely parser has been tuned for the Penn treebank specifically. They also say that they anticipate being able to improve their results by tuning the model and preparing the data. My question is whether that is a good thing. I think it is a benefit that PTSG works without having to prepare the data or being changed for each data set. How well does the Berkely parser do on other data? Should the state of the art require data prep and tuning?
ReplyDelete
Replies
dabelDecember 1, 2013 at 2:59 PM
I'm a bit confused as to what exactly distinguishes a CFG rule from a non-CFG rule, and what sort of non-CFG rules the model learns (i.e. I didn't find section 7 + figure 4 that enlightening).. At the beginning of the paper they mention that CFGs suffer due to independence assumptions enforced by the context free-ness of the CFG - how do TSGs avoid being context free, and intuitively, what sort of rules should we expect to see as a result?
ReplyDelete
Replies
UnknownDecember 1, 2013 at 4:28 PM
In a way, this seems to be a way of handling CFGs with larger scale rules, or tree substitutions. Beyond this disparity, how else does the algorithm the paper describes differ fundamentally from CFGs, or is the first sentence (broadly) correct?
ReplyDelete
Replies
BrawnerDecember 1, 2013 at 7:33 PM
Like Lauren, I wanted more information about how they would tune this parser to work on the Penn Treebank better. How are the other parsers tuned to take advantage of that dataset? They also suggested improvements could be made to match the state of the art parsers, but I missed what those improvements could have been and how they would've improved the parsing. What situations would their tuned algorithms solve, and how much of a percentage boost would they expect with that dataset?
ReplyDelete
Replies
Kurt SpindlerDecember 2, 2013 at 9:40 AM
I'll pose two questions, one specific to the paper, and another more general about parsing.
1. In Table 1, the PCFG is shown to have 3500 rules, and the P_0^M TSG to have 6609 rules. However, in Discussion (section 7), they note that this TSG had 5611 CFG rules and 1008 TSG rules. Why can't the PCFG model capture more of the CFG rules that the TSG finds? Perhaps it's a moot point since TSG discovers more rules.. but it seems like a weird result.

2. I wonder what inter-rater agreement is on the Penn Treebank, to calibrate expectations for an upper bound on parsing performance. In playing with Stanford Parser throughout this semester [for another class] (correct me if I'm wrong, but I believe it uses PCFG), I'm yet to see a case where it's given me complete bogus results. Therefore, I'm wondering, how close are we to good-enough / as-good-as-can-be-hoped parsing performance, and how much do we actually gain with this new parsing methodology? (I'm sure there are cases where Stanford Parser messes up, but I haven't found them.) Moreover, when PCFG or TSG parsing fails, does it give not-quite-perfect-but-still-usable results, or complete bogus ones?
ReplyDelete
Replies
akovacsDecember 2, 2013 at 10:38 AM
They talk about using CYK to parse but I don't know if I understand how exactly bottom up parsing works for a tree substitution grammar. It seems like a very PCFG-centric algorithm. How does this actually work? They don't explain very much in this little section.
ReplyDelete
Replies
UnknownDecember 2, 2013 at 10:54 AM
In the paper, the author didn't make clear what is inconsistency problem. What is consistent in P550? "P(t|e)is either equal to 1 (when t and e are consistent) or 0 (otherwise)". Does it mean t and e are the same?

And I don't understand how they compute the base distribution Pc0. "PC0, has a similar generative processbut draws non-terminal expansions from a treebank-trained PCFG instead of a uniform distribution. " So how does Pc0 draw expansions from treebank-trained PCFG?
ReplyDelete
Replies
UnknownDecember 2, 2013 at 11:33 AM
I am also interested in their statement about Penn-treebank. They states that "careful data preparation and model tuning could greatly improve our model’s performance". What kind of data preparation could be used and are there any examples?
ReplyDelete
Replies
UnknownDecember 2, 2013 at 11:37 AM
I'm confused about figure 3 and 6.1: "These trees cannot be modeled accurately with a CFG because expanding A and B terms into terminal strings requires knowing their parent's non-terminal." I'm not sure what they mean by accurately (CFG can only generate a superset? CFG requires many rules?), and I'm not sure I understand what they mean by "knowing their parent's non-terminal" at all.

A small note: what does it mean for something to be accepted into Metropolis Hastings?
ReplyDelete
Replies
Jun Ki LeeDecember 2, 2013 at 5:16 PM
It seems to me that it is novel in that it does not require predefined parameters for learning the parse rules, however, there is still two parameters remaining alpha & beta and also requires pre-learned CFG for P_0^C for better performance. If it is so, I'd like to know what tuning has to be done in other models and how complicated the tuning processes are compared to the PTSG model presented in this paper? Also, if it needs learnt PCFG data for P_0^C, as a whole process it doesn't seem like a simplification of the existing method.
I didn't quite understand Figure 4. I do not know what x axis and y axis mean. Could someone answer what these axes are?
ReplyDelete
Replies
UnknownDecember 3, 2013 at 6:29 AM
Because they are using hyperparameters in their model I would be interested in seeing how the selection of hyperparameters affects their results. For example, how bad would it be if their selected hyperparameters obtained a local minima instead of a global one? In addition, recently in class the topic of human cross-referencing accuracy came up. I think it would be interesting to compare the accuracy of this model to that of a human, and not a "gold standard".
ReplyDelete
Replies

Add comment

Pages

Tuesday, November 26, 2013

Tree Substitution Grammars

40 comments: