Statistical models of grammar induction, by which I mean models that estimate a probability distribution over both syntactic analyses and word strings after observing only word strings, are typically a type of model called a

*generative model*. This notion of "generative" is actually pretty similar to the notion of "generative" in modern syntax: there is an inventory of re-usable substructures or pieces (like context free grammar rules, which are really just local subtrees), and structures and word strings are produced by selecting pieces from your inventory. The primary difference from standard syntax is that each choice is associated with a probability score.

Now, there's a little bit of a technical wrinkle here, and I was wondering if the other commenters here had any reactions. The thing that makes these models "generative" in the statistical sense is that each "probability score" is restricted to be a conditional probability of the right hand side, given the parent node. So if you fix the parent node, and add up the probability score of each possible right hand side of that parent node, the sum will be one. This constraint on the probability scores is computationally convenient, because it means we can guarantee that the probability of every tree sums to one, since the probability of every subtree sums to one in this recursive fashion, and we can get the probability of any particular structure by just multiplying the probability scores of the pieces that actually appear in that structure. However, this has the consequence that the model as a whole is parameterized in terms of observations given latent variables.

An alternative is to allow the "probability scores" to be any non-negative number, using what's called an

*undirected* or

*energy* model. There are two reasons that people don't do this in practice. First, if probability scores are any non-negative number, and there are unobserved variables (we only see words, not dependency arcs, for example), the computational complexity gets really bad. If we multiply the probability scores of each piece in a structure, the resulting number is only

*proportional* to the probability of the structure (the constraint in generative models just guarantees that the proportionality constant is always 1). To actually get the probability of the structure, we have to add up the products of the probability scores of all the pieces that we

*could have* used, including the strings that we

*could have* seen but did not. Obviously, doing this kind of sum exactly is impossible (although random approximations are possible). Second, it turns out that undirected models do not have any more expressive power than directed models: it's been proven that given any set of weights from an undirected model, it is possible to find a set of conditional probabilities for the corresponding directed model that produces exactly the same probability distribution over structures.

So people pretty much never use undirected models for grammar induction: even though they are perfectly well-defined, they are a nightmare computationally, and don't buy you anything in principle. On the other hand, they have different biases, so we might end up learning a different distribution over structures in practice. Also, it leads to this weird situation where we are learning models that are parameterized in terms of the probability of the observations given the unobserved stuff:

even though we usually don't care about the probability of the stuff we saw: if you "run" the models in this direction to generate new "observed stuff" like word strings, you typically get garbage. What we usually care about is the probability of the unobserved stuff given the observed stuff. You can do this by using Baye's rule to "run the models in reverse:"

i.e., we see a sentence and we want to know how to parse it; we don't care what the probability of that sentence is.

On a more theoretical level, though, I wonder how important this issue is. I mentioned in another thread that a notion of grammaticality based on typical sets might be interesting, but the feasibility of such a model relies on getting good probabilities when running the models "forward." Another thing to consider is the potential for theoretical tidiness, with one generative model responsible for both production (running "forwards") and perception (running "in reverse"). On the other hand, maybe it just isn't possible to partial out specifically linguistic influences (subject-verb agreement) from general knowledge (movie-popcorn agreement) in string probabilities.

I don't necessarily have a specific question, but these are some things I've been thinking about lately and wondered if anybody had any reactions.

(PS feel free to move this to the computational or out-of-the-box section... I wasn't sure whether to classify based on topic or methodology...)