Specializations > Language Acquisition

Ruminations on using generative probabilistic models for language acquisition

(1/4) > >>

jkpate:
Statistical models of grammar induction, by which I mean models that estimate a probability distribution over both syntactic analyses and word strings after observing only word strings, are typically a type of model called a generative model. This notion of "generative" is actually pretty similar to the notion of "generative" in modern syntax: there is an inventory of re-usable substructures or pieces (like context free grammar rules, which are really just local subtrees), and structures and word strings are produced by selecting pieces from your inventory. The primary difference from standard syntax is that each choice is associated with a probability score.

Now, there's a little bit of a technical wrinkle here, and I was wondering if the other commenters here had any reactions. The thing that makes these models "generative" in the statistical sense is that each "probability score" is restricted to be a conditional probability of the right hand side, given the parent node. So if you fix the parent node, and add up the probability score of each possible right hand side of that parent node, the sum will be one. This constraint on the probability scores is computationally convenient, because it means we can guarantee that the probability of every tree sums to one, since the probability of every subtree sums to one in this recursive fashion, and we can get the probability of any particular structure by just multiplying the probability scores of the pieces that actually appear in that structure. However, this has the consequence that the model as a whole is parameterized in terms of observations given latent variables.

An alternative is to allow the "probability scores" to be any non-negative number, using what's called an undirected or energy model. There are two reasons that people don't do this in practice. First, if probability scores are any non-negative number, and there are unobserved variables (we only see words, not dependency arcs, for example), the computational complexity gets really bad. If we multiply the probability scores of each piece in a structure, the resulting number is only proportional to the probability of the structure (the constraint in generative models just guarantees that the proportionality constant is always 1). To actually get the probability of the structure, we have to add up the products of the probability scores of all the pieces that we could have used, including the strings that we could have seen but did not. Obviously, doing this kind of sum exactly is impossible (although random approximations are possible). Second, it turns out that undirected models do not have any more expressive power than directed models: it's been proven that given any set of weights from an undirected model, it is possible to find a set of conditional probabilities for the corresponding directed model that produces exactly the same probability distribution over structures.

So people pretty much never use undirected models for grammar induction: even though they are perfectly well-defined, they are a nightmare computationally, and don't buy you anything in principle. On the other hand, they have different biases, so we might end up learning a different distribution over structures in practice. Also, it leads to this weird situation where we are learning models that are parameterized in terms of the probability of the observations given the unobserved stuff:

even though we usually don't care about the probability of the stuff we saw: if you "run" the models in this direction to generate new "observed stuff" like word strings, you typically get garbage. What we usually care about is the probability of the unobserved stuff given the observed stuff. You can do this by using Baye's rule to "run the models in reverse:"

i.e., we see a sentence and we want to know how to parse it; we don't care what the probability of that sentence is.

On a more theoretical level, though, I wonder how important this issue is. I mentioned in another thread that a notion of grammaticality based on typical sets might be interesting, but the feasibility of such a model relies on getting good probabilities when running the models "forward." Another thing to consider is the potential for theoretical tidiness, with one generative model responsible for both production (running "forwards") and perception (running "in reverse"). On the other hand, maybe it just isn't possible to partial out specifically linguistic influences (subject-verb agreement) from general knowledge (movie-popcorn agreement) in string probabilities.

I don't necessarily have a specific question, but these are some things I've been thinking about lately and wondered if anybody had any reactions.

(PS feel free to move this to the computational or out-of-the-box section... I wasn't sure whether to classify based on topic or methodology...)

Daniel:
Interesting post. I don't have too much of a direct response, but I have several related ideas. Sorry this might drag the discussion to a tangent...

1. Computational complexity
This is a huge issue in engineering (speech recognition, machine translation, etc.). But it apparently isn't much of an issue in cognition-- we do this quite easily. It's hard to know exactly what the complexity of the brain is compared to a computer, but whatever the correct algorithm is, we should be able to implement it in linear time for example.
If the best algorithm found is one that is computationally impractical, it is probably the wrong algorithm. That's an odd puzzle because the problem does really seem to be very complicated.

2. What is the object of study?

--- Quote ---On the other hand, maybe it just isn't possible to partial out specifically linguistic influences (subject-verb agreement) from general knowledge (movie-popcorn agreement) in string probabilities.
--- End quote ---
This is where my current thoughts are about syntax, but beyond that point. You also said this:

--- Quote ---there is an inventory of re-usable substructures or pieces (like context free grammar rules, which are really just local subtrees), and structures and word strings are produced by selecting pieces from your inventory.
--- End quote ---
That's a standard position, and one that I question, actually.

Language has hierarchical structure (as much as it bothers me to admit that-- I've been trying to find some alternative for years and have finally accepted that it must be correct). But WHERE is that structure? I think this can lead us to be a better answer.

I'll copy two paragraphs from a short proposal I wrote for my dissertation project:

--- Quote from: my in-progress dissertation ---It is widely accepted in the field of Theoretical Linguistics that a major research goal is determining the relationship between form and meaning in natural languages. Since Chomsky’s early work (1956, 1957, 1959), the formal study of Syntax has assumed a model of individual nodes arranged in a hierarchical structure. Essentially each word is given a position within a hierarchy such that structural relationships can be determined between the nodes in a sentence. This approach has been extremely popular and productive for the field. It is the purpose of this dissertation to take another look at these basic assumptions in order to introduce a more accurate method for understanding the relationship between linguistic form (“Syntax”) and underlying meaning (“Semantics”). In short, the exact form of a linguistic utterance will be seen as the result of linearization and widely variable language-specific rules, while the underlying semantic structure will resemble what is assumed in the standard hierarchical models but without many of the complications introduced by trying to account for peculiarities of form.
While the standard theory often appears successful and leads to insightful results, I argue that this is an illusion: most of the time, linguistic form is isomorphic to semantic structure, but exceptions reveal that the situation is more complex. It is not unexpected that, ceteris paribus, linguistic form would reflect semantics, but this is not a strict rule in human language.
--- End quote ---

To name some specific examples:
1. Coordination
I'm now convinced that "and" is not actually a node in the tree, but rather the linear representation of a coordinated structure (whatever that actually is is irrelevant).

2. Control verbs as in "John tries [John] to swim."
There has been a huge amount of debate about how that extra "[John]" gets interpreted semantically. Maybe it's movement. Maybe there's some invisible pronoun. Etc. My answer is fairly simple: a preliminary linear parse adds "John" back in to the structure so that it is available for semantics.

3. Periphrasis:
Some language have multiple realizations for a single meaning. In a single paradigm, something that might be realized as an affix in one case might be realized as a separate word in another. There are some very convincing examples of this, but I can't list them off the top of my head. To take a familiar (and probably slightly less convincing example), consider English tense:
Present: sometimes -s
Past: -ed
Future: will
The future is oddly a separate word with very different syntax compared to the others.
My suggestion: this is just a surface-level issue, not relevant to the real hierarchical structure.

So what's left at the actual hierarchical structure?
Core relationships like modification and argument-selection. Maybe almost nothing else.

(There is a complication here in that obviously we need some kind of hierarchical structure to begin interpreting in the first place such as recognizing "try to" as a control structure rather than some other linear string. But I'll set that aside for now. It's probably some kind of boostrapping.)

In short: I think syntacticians are looking at the wrong thing.

So for your question, this suggests that building up a bunch of subtrees from surface form is the wrong approach and hence why it is computationally difficult and not especially reliable.

jkpate:

--- Quote from: djr33 on February 20, 2014, 01:26:14 PM ---Interesting post. I don't have too much of a direct response, but I have several related ideas. Sorry this might drag the discussion to a tangent...

1. Computational complexity
This is a huge issue in engineering (speech recognition, machine translation, etc.). But it apparently isn't much of an issue in cognition-- we do this quite easily. It's hard to know exactly what the complexity of the brain is compared to a computer, but whatever the correct algorithm is, we should be able to implement it in linear time for example.
If the best algorithm found is one that is computationally impractical, it is probably the wrong algorithm. That's an odd puzzle because the problem does really seem to be very complicated.

--- End quote ---

Yes, this is what I was trying to get at. Researchers have been avoiding undirected models because they are hard to engineer, but, since people probably use a resource-limited approximation algorithm anyway, this isn't really a justification for completely ignoring undirected models. There are still practical issues, however, in interpreting an implementation that relies on lots of approximations: is the behavior of our implementation due to the model, or to the kind of approximation our inference makes, or both?

--- Quote from: djr33 on February 20, 2014, 01:26:14 PM ---2. What is the object of study?

--- End quote ---

For me, at this point, the object of study is the data available to children. What kinds of assumptions about the shape of linguistic structures turns that data into good evidence for structures? The algorithm the child uses is in principle separate from the assumptions about possible linguistic structures the algorithm makes; or, to put it another way, there are many different algorithms that rely on the same set of assumptions (or biases, or prior expectations) about what linguistic structures look like, and I'd like to tackle the issue of learning biases without having to get the details of the learning procedure right. The clearest way to do this is to use a really accurate inference procedure, so we can be confident that the conclusions about linguistic structure that our implementation reaches after seeing data are due to the model assumptions.

--- Quote from: djr33 on February 20, 2014, 01:26:14 PM ---So what's left at the actual hierarchical structure?
Core relationships like modification and argument-selection. Maybe almost nothing else.

--- End quote ---

This sounds reasonable to me. In practice, the grammar induction systems that people have gotten to work have relied on dependency grammar or CCG representations that focus on modification and argument-selection. Grammar induction systems that model the details of linear order have not worked.

--- Quote from: djr33 on February 20, 2014, 01:26:14 PM ---(There is a complication here in that obviously we need some kind of hierarchical structure to begin interpreting in the first place such as recognizing "try to" as a control structure rather than some other linear string. But I'll set that aside for now. It's probably some kind of boostrapping.)

In short: I think syntacticians are looking at the wrong thing.

So for your question, this suggests that building up a bunch of subtrees from surface form is the wrong approach and hence why it is computationally difficult and not especially reliable.

--- End quote ---

I'm a bit confused by this suggestion. Building up a bunch of subtrees from surface forms is not that hard and can be done relatively efficiently. What is hard is estimating probabilities for those subtrees if we don't constrain the probability score of each piece to be conditional probabilities. Or maybe I haven't understood your suggestion?

Daniel:

--- Quote ---Yes, this is what I was trying to get at. Researchers have been avoiding undirected models because they are hard to engineer, but, since people probably use a resource-limited approximation algorithm anyway, this isn't really a justification for completely ignoring undirected models. There are still practical issues, however, in interpreting an implementation that relies on lots of approximations: is the behavior of our implementation due to the model, or to the kind of approximation our inference makes, or both?
--- End quote ---
The word "approximation" bothers me. Given the fact that children do learn language, they probably end up adapting it to the most efficient form for processing in some sense (broadly), so it probably isn't something that needs to be approximated-- it probably is relatively optimized, as a system.

Of course in terms of surface form, they are approximating something, such as distribution probabilities.

So I think we agree on this.

--- Quote ---For me, at this point, the object of study is the data available to children. What kinds of assumptions about the shape of linguistic structures turns that data into good evidence for structures?
--- End quote ---
I agree, but the details are crucial: how is that input interpreted? What is it mapped onto?
I would tentatively say something like: they are identifying relationships in the sentences (dependencies) and mapping them probabilistically to forms.
I wouldn't say that they are analyzing surface forms directly.

--- Quote ---This sounds reasonable to me. In practice, the grammar induction systems that people have gotten to work have relied on dependency grammar or CCG representations that focus on modification and argument-selection. Grammar induction systems that model the details of linear order have not worked.
--- End quote ---
That sounds right to me. I'm not suggesting a purely linear model. Instead, I'm suggesting one that is designed to intelligently/complexly map a linear form onto a hierarchical structure. The indirectness is what will help.

--- Quote ---I'm a bit confused by this suggestion. Building up a bunch of subtrees from surface forms is not that hard and can be done relatively efficiently. What is hard is estimating probabilities for those subtrees if we don't constrain the probability score of each piece to be conditional probabilities. Or maybe I haven't understood your suggestion?
--- End quote ---
I'm saying that the subtrees are the wrong objects to be working with, that they don't exist in an algorithm of language. Instead, the assembly should be done at the level of something like conceptual structure (dependencies, whatever) then that should be mapping onto a linear form.

Let me explain with an example of coordination. I think the actual structure is something like this:
"a and b"
I'm not certain about the tree, but I am fairly confident that the word "and" is not a node in it, not at the relevant level. Then as a coordinated structure, "and" gets inserted during linearization.

(Don't take my ideas too seriously yet-- I'm working out a lot of details/complications.)

So in short, what I'm suggesting is that attempting to map 1-to-1 from words to nodes in a subtree is a bad plan. The must better approach is to work indirectly, sort of like in machine translation going from source language to semantic metalanguage to output language.

Anyway, that's what I think.

jkpate:

--- Quote from: djr33 on February 20, 2014, 08:47:17 PM ---
--- Quote ---Yes, this is what I was trying to get at. Researchers have been avoiding undirected models because they are hard to engineer, but, since people probably use a resource-limited approximation algorithm anyway, this isn't really a justification for completely ignoring undirected models. There are still practical issues, however, in interpreting an implementation that relies on lots of approximations: is the behavior of our implementation due to the model, or to the kind of approximation our inference makes, or both?
--- End quote ---
The word "approximation" bothers me. Given the fact that children do learn language, they probably end up adapting it to the most efficient form for processing in some sense (broadly), so it probably isn't something that needs to be approximated-- it probably is relatively optimized, as a system.

Of course in terms of surface form, they are approximating something, such as distribution probabilities.

So I think we agree on this.

--- Quote ---For me, at this point, the object of study is the data available to children. What kinds of assumptions about the shape of linguistic structures turns that data into good evidence for structures?
--- End quote ---
I agree, but the details are crucial: how is that input interpreted? What is it mapped onto?
I would tentatively say something like: they are identifying relationships in the sentences (dependencies) and mapping them probabilistically to forms.
I wouldn't say that they are analyzing surface forms directly.

--- End quote ---

Hmm, maybe my confusion has to do with the term "surface form" and "distribution probabilities" that are in "terms of surface form." What do you consider to surface form? Sometimes people use "surface form" to mean the string that is generated, and sometimes they use "surface form" to mean the whole structure after a series of operations, like transformations, have been applied. These models define probabilities over structures, they're not just relying on n-gram statistics from the string.