# Linguist Forum

## General Linguistics => Linguist's Lounge => Topic started by: freknu on June 17, 2014, 03:52:37 AM

Title: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 17, 2014, 03:52:37 AM
Should I go schedule some CAT-scans and prepare for a drawn-out fight against brain cancer ... or is this paper the cause of my headaches? :/

(The "etymologies" begin at page 16.)

(EDIT)

Criticism by Lyle Campbell:
http://www2.hawaii.edu/~lylecamp/Campbell%20OriginsProofs.pdf
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 17, 2014, 04:55:05 AM
It's well out of fashion to like Greenberg, but I'll admit that I do, secretly, quite a bit. Ruhlen (his most prominent student), however, has jumped the proverbial shark.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 17, 2014, 05:18:03 AM
I think I read about, or maybe saw a documentary, his efforts to classify the African languages. Exactly how well do those stand up to closer scrutiny (the mass comparisons)?

Just four language families/branches for all of Africa? Considering the vast sea of branches just in Eurasia it does seem a little ... lacking.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 17, 2014, 05:43:26 AM
They don't stand up to scrutiny at all. His work is well outside of mainstream acceptance, and his methodology is considered dubious at best.

Virtually all serious work in historical linguistics takes it as axiomatic that there's a hard horizon beyond which we won't be able to reconstruct proto-languages reliably. How far back that goes depends on a number of factors, of course, but it's hard to imagine getting much before farther back than 6-9k years ago even in the best of circumstances. Ruhlen thinks he's constructing forms nearly an order of magnitude older than that.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 17, 2014, 06:09:02 AM
They don't stand up to scrutiny at all. His work is well outside of mainstream acceptance, and his methodology is considered dubious at best.

So is there any broader classification of African languages at all? Is the stuff I read at wikipedia reliable? Or are African languages still a vast unexplorer wilderness?

Virtually all serious work in historical linguistics takes it as axiomatic that there's a hard horizon beyond which we won't be able to reconstruct proto-languages reliably. How far back that goes depends on a number of factors, of course, but it's hard to imagine getting much before farther back than 6-9k years ago even in the best of circumstances. Ruhlen thinks he's constructing forms nearly an order of magnitude older than that.

As for his work in historical linguistics, I looked around a bit and I think it was a critique by Lyle Campbell (no clue about his merits) which gave a concise and perceptive statement that analysing the level of change between the hypothetical PIE and its descendants, 5,000-10,000 years you statistically cannot separate any possible valid relations from the noise — making comparative linguistics impossible past a certain point, the event horizon if you want to be fancy ;)
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 17, 2014, 10:44:13 AM
Some of that stuff is pretty silly, though it is interesting to hypothetically explore.

As for Africa, his classification is more or less accepted broadly. That doesn't mean there are not exceptions, but at least in general the major four subgroupings (plus Austronesian in Madagascar) are the standard starting point. This is not true, for example, for his classification in the Americas which appears to be wrong, though I still think it's an interesting idea to line up the linguistic distribution with migrations.

Overall there's a question of what it would mean to be "right" or "wrong" in these cases-- I doubt Greenberg or Ruhlen actually thinks they're strictly speaking correct about any of the etymologies, but rather that they're narrowing down the possibilities and making good guesses. Taking it in that context, the work is less crazy, but there still are some fairly obvious limitations that they seem to be ignoring.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 17, 2014, 01:44:20 PM
They don't stand up to scrutiny at all. His work is well outside of mainstream acceptance, and his methodology is considered dubious at best.

So is there any broader classification of African languages at all? Is the stuff I read at wikipedia reliable? Or are African languages still a vast unexplorer wilderness?

The bulk of African languages are reasonably well classified into three or four families. The weakest evidence is probably what groups Nilo-Saharan, but even that's leagues ahead of these Proto-Sapiens reconstructions.

Like you suggest, some "isolates" are only isolates because we haven't yet done the reconstruction work yet (i.e., a lot of the Amazon), but then there are a great many that will probably never be reliably linked because the time depth leaves us drowning in noise.

Virtually all serious work in historical linguistics takes it as axiomatic that there's a hard horizon beyond which we won't be able to reconstruct proto-languages reliably. How far back that goes depends on a number of factors, of course, but it's hard to imagine getting much before farther back than 6-9k years ago even in the best of circumstances. Ruhlen thinks he's constructing forms nearly an order of magnitude older than that.

As for his work in historical linguistics, I looked around a bit and I think it was a critique by Lyle Campbell (no clue about his merits) which gave a concise and perceptive statement that analysing the level of change between the hypothetical PIE and its descendants, 5,000-10,000 years you statistically cannot separate any possible valid relations from the noise — making comparative linguistics impossible past a certain point, the event horizon if you want to be fancy ;)

Bingo! :)
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 17, 2014, 02:57:10 PM
I'm bothered by the idea of a strict limit. Instead, I think it's exponentially more difficult earlier in time. So maybe 12,000 works sometimes. If really pushed, maybe 15,000. But 100,000 is almost certainly ridiculous.
[Edit: I also think methods for looking earlier should not be eyeballing dictionaries. They should involve systematic reconstruction and use of the earliest records then comparing those to each other. It will never go anywhere to compare what languages look like today to what they might have looked like a long time ago, without serious/rigorous backtracking. It is still likely to fail at a certain time depth, but if anything that's the right method.]

Also, I meant to add above that the Khoisan group has been questioned recently and I have a few references to that effect if you're interested:

Güldemann, Tom & Edward D Elderkin. 2010. On external genealogical relationships of the Khoe family. In Matthias Brenzinger & Christa König (eds.), Khoisan Languages and Linguistics: Proceedings of the 1st International Symposium January 4-8, 2003, 15–52. Köln: Rüdiger Köppe Verlag.
Heine, Bernd & Henry Honken. 2010. The Kx’a Family. A New Khoisan Genealogy. Journal of Asian and African Studies 79. 5–36.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 19, 2014, 12:02:20 PM
Something that I recently came across which I quite liked.

Quote
A remote relationship of Indo-European to the Uralic languages is possible. Geographically, the earliest reconstructing locations of the two families are contiguous.

On the whole, however, the lexical resemblances between Indo-European and Uralic are very sparse; the two families, if they are related at all, must have separated thousands of years before the breakup of Proto-Indo-European.

If Indo-European is related to other language-families—e.g., to Afro-Asiatic (which includes the Semitic languages) or to Kartvelian (which includes Georgian)—it must have diverged from them much earlier than it diverged from Uralic, because the number of cogent resemblances is still smaller.

http://www.protogermanic.com/2013/08/indo-european-languages.html

It's a bit of an oversimplification but what we are looking at is essentially tn ... the difficulty isn't exactly linear, and trying to reconstruct farther back using proto-languages makes it even tougher — if one can even consider it valid practice to begin with.

(EDIT)

As djr mentioned, exponential — but at some point you still reach a situation where the SNR !>1, and you can now stare yourself blind at the event horizon without making any further progress.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 19, 2014, 03:21:07 PM
It's all statistical. As the time depth increases, so does the margin of error. So while the available data may suggest a closer relationship with Uralic than other groups, the margin of error also increases with that time depth, so in statistical terms there may be no reason at all to assume any relevance there. Likewise, there is no true "event horizon" because we can always make some kind of guess-- but with ever increasing margins of error, we end up at a point where there is almost no statistically relevant information. In other words, we can continue to make hypotheses but with no way to test them. In that sense, I guess there is an "event horizon" in that the tests don't even hint one way or the other in a statistical sense.
Remember even with Indo-European the margin of error is far greater than 0. So there is no point before which we can know things for certain and point after which we can't, but rather just a reasonable time depth where our guesses don't seem too crazy.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 19, 2014, 04:41:26 PM
That's what I meant by "event horizon" and "SNR not geater than 1" ... any valid relation is drowned out in noise, so even if you could make hypotheses and guestimations, it's not falsifiable; even if SNR was far less than 1 there could still be valid data in there, there's just no way of getting it out.

This may or may not correspond to any particular data set or time scale, but the statistical event horizon remains.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 19, 2014, 04:59:41 PM
One problem is that the SNR is not known. So the probable SNR increases with time depth, but we have no way to know exactly how. Overall, yes, what you said. But we don't ever know when it's too far.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 20, 2014, 05:58:39 AM
One problem is that the SNR is not known. So the probable SNR increases with time depth, but we have no way to know exactly how. Overall, yes, what you said. But we don't ever know when it's too far.

Not true. Definitionally, you never know whether a particular piece of data is signal or noise. That's why it's noise. But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 20, 2014, 08:54:33 AM
How?

Icelandic hasn't changed much in the past 1000 years. English has changed immensely. The amount of noise entirely depends on the specific language(s) in question.

A normal problem involving noise, as you said, does not allow us to separate the real data from the noise, so we consider the SNR. But in this case, the SNR itself is unknown because the amount of noise is dependent on a number of unknown (unknowable?) factors.

Quote
But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.
Can you explain this a bit? In historical linguistics everything is a guess (some better than others, of course). So how can we absolutely know anything?
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 21, 2014, 05:59:31 AM
How?

Icelandic hasn't changed much in the past 1000 years. English has changed immensely. The amount of noise entirely depends on the specific language(s) in question.

A normal problem involving noise, as you said, does not allow us to separate the real data from the noise, so we consider the SNR. But in this case, the SNR itself is unknown because the amount of noise is dependent on a number of unknown (unknowable?) factors.

I'm not really sure what you're understanding as "noise" here. In a standard cross-linguistic corpus analysis, the challenge is simply to separate similarities due to relatedness from similarities due to chance, as conditioned by phonological patterns. There are many ways to go about this, but most techniques involve predicting a baseline of similarity expected between unrelated languages and then measuring forward against that standard. With this information, can you know the relatedness of two particular lexical items? Usually not very well. Can you know the relatedness of two large lexical sets? Yes, to a high degree of probability.

Quote
But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.
Can you explain this a bit? In historical linguistics everything is a guess (some better than others, of course). So how can we absolutely know anything?

If your standard is absolute knowledge (whatever that means!), you'll have to find a new field. In science, everything we know is subject to revision in the face of newer and better evidence. That caveat comes pre-baked into how scientific epistemologies use the word "know", and there's nothing particularly unusual about historical linguistics in this regard.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 21, 2014, 10:53:31 AM
Quote
I'm not really sure what you're understanding as "noise" here. In a standard cross-linguistic corpus analysis, the challenge is simply to separate similarities due to relatedness from similarities due to chance, as conditioned by phonological patterns. There are many ways to go about this, but most techniques involve predicting a baseline of similarity expected between unrelated languages and then measuring forward against that standard. With this information, can you know the relatedness of two particular lexical items? Usually not very well. Can you know the relatedness of two large lexical sets? Yes, to a high degree of probability.
All I'm saying is that we never really can know. Finding a suitable baseline is challenging because the comparison of any two languages is not identical to the comparison of others. Certainly we can find a reasonable range and realize that it's hard to compare/reconstruct beyond, say, 15,000 years. But we can't rule out the possibility that there are tiny bits of relevant evidence still there. As I said, it's exponentially less precise, but that doesn't mean there's precisely no information or that there is some "wall" we can't go past. It just means that there's less and less reason to try the farther back we go. That's all I've been saying. This "wall" is a myth, just as much as the mass comparison method. Statistically speaking, it's more correct, but only because it approximates the real situation, not because it really exists.
There's no reason at all we couldn't find some case where we can compare languages 25,000 years old. The odds are against it, and we may never know if that's due to chance or not, but there probably are some cases out there. Our time is, however, likely better spent on other projects.

Quote
If your standard is absolute knowledge (whatever that means!), you'll have to find a new field. In science, everything we know is subject to revision in the face of newer and better evidence. That caveat comes pre-baked into how scientific epistemologies use the word "know", and there's nothing particularly unusual about historical linguistics in this regard.
I completely agree. So why did you write "we absolutely can know"? I don't think we can. I think it's always possible that an ancient comparison is correct, but just incredibly unlikely. The only way to know if we're going back "too far" is based on how reliable we want to be. Taken that way, older comparisons aren't so bad, as long as they aren't presented as fact or even reliable guesses.

Further, this goes back to what I said earlier: if we compare reconstructions to reconstructions, that should (probabilistically speaking) reduce some of the noise. Borrowing/contact, for example, would be partially eliminated if we use a well reconstructed version of PIE and compare it to a well reconstructed version of Proto-Uralic. This should boost how far back we can go. I suppose in this case the limit would be simply based on available information. Languages aren't infinite in a reconstruction sense-- we rely on data points like lexical items and perhaps syntactic constructions. So we reconstruct partial languages ("core vocabulary") then would have less to work with and have even less as a result of a secondary reconstruction. (And of course overall it would be less reliable, not to say entirely irrelevant.) I'd be interested in seeing more of this and less number crunching based on surface data :)
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 21, 2014, 05:21:17 PM
Quote
If your standard is absolute knowledge (whatever that means!), you'll have to find a new field. In science, everything we know is subject to revision in the face of newer and better evidence. That caveat comes pre-baked into how scientific epistemologies use the word "know", and there's nothing particularly unusual about historical linguistics in this regard.
I completely agree. So why did you write "we absolutely can know"?

I doubt he meant "it is possible to absolutely know" but rather "it is absolutely possible to know".
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: jkpate on June 22, 2014, 12:44:38 AM
One problem is that the SNR is not known. So the probable SNR increases with time depth, but we have no way to know exactly how. Overall, yes, what you said. But we don't ever know when it's too far.

Not true. Definitionally, you never know whether a particular piece of data is signal or noise. That's why it's noise. But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.

Hmm, when we talk about quantifying the degree of entropy in the system, I think about explicit probabilistic models that allow us to compute various entropies and get numbers in bits. While there is a bit of work in this direction (http://www.pnas.org/content/early/2013/02/05/1204678110.full.pdf), it's far from standard in historical linguistics as I understand it. Is this the kind of work that you are referring to when you talk about quantifying the degree of entropy in the system?
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 22, 2014, 03:59:28 AM
Quote from: freknu
I doubt he meant "it is possible to absolutely know" but rather "it is absolutely possible to know".
Ah, probably. But still, how is is then certain that we can know, when we don't know much of anything in historical linguistics?

jkpate, agreed :)

In the end, we're all saying basically the same thing, though in different ways: it's crazy to try to reconstruct after, say, 20,000 years, and probably a bad idea even after 10,000. But, no, I still don't see any particular wall.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: jkpate on June 22, 2014, 05:10:28 AM
I don't agree that we don't know much of anything. I'd say we know things with a greater degree of uncertainty than in other fields, even if we usually don't quantify that degree of uncertainty.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 22, 2014, 06:17:27 AM
One problem is that the SNR is not known. So the probable SNR increases with time depth, but we have no way to know exactly how. Overall, yes, what you said. But we don't ever know when it's too far.

Not true. Definitionally, you never know whether a particular piece of data is signal or noise. That's why it's noise. But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.

Hmm, when we talk about quantifying the degree of entropy in the system, I think about explicit probabilistic models that allow us to compute various entropies and get numbers in bits. While there is a bit of work in this direction (http://www.pnas.org/content/early/2013/02/05/1204678110.full.pdf), it's far from standard in historical linguistics as I understand it. Is this the kind of work that you are referring to when you talk about quantifying the degree of entropy in the system?

Definitely. Though it's not quite the standard of practice I'd like it to be, work along these lines isn't all that rare either. Most work in panchronic phonology (which, I think it's now fair to say, has won the war for hearts and minds among phonologists, even if the rest of linguistics hasn't yet jumped on board) moves in these directions. I'm thinking in particular of work by Juliette Blevins and George van Driem.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 22, 2014, 06:20:16 AM
In the end, we're all saying basically the same thing, though in different ways: it's crazy to try to reconstruct after, say, 20,000 years, and probably a bad idea even after 10,000. But, no, I still don't see any particular wall.

Can you find me a linguist who advocates for this "wall" you speak of? You're arguing against a strawman.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 22, 2014, 01:28:25 PM
You, in this thread. And others I've heard quoted against the Ruhlen et al position.

Quote from: MalFet
Virtually all serious work in historical linguistics takes it as axiomatic that there's a hard horizon beyond which we won't be able to reconstruct proto-languages reliably.
Quote from: freknu
you statistically cannot separate any possible valid relations from the noise — making comparative linguistics impossible past a certain point, the event horizon
Quote from: freknu
As djr mentioned, exponential — but at some point you still reach a situation where the SNR !>1, and you can now stare yourself blind at the event horizon without making any further progress.
Quote from: MalFet
But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.

In some sense, there may be such a barrier, but we cannot know precisely where it is (5,000 years? 10,000 years? 20,000 years? more?), and this all depends on the available data, so that, actually in theory, we could track the languages back farther in time given more data. So there is no barrier per se, except as a limit to what we have available to us (data), and we can never actually know where that barrier is. We can make some pretty reasonable guesses about what to do and what not to do, but it's an unscientific position to point to a barrier and then attempt to give it some kind of cutoff date. As I've been saying, it's just less and less likely to be productive. There's a certain point where, like playing the lottery against great odds, most linguists will give up. Anyone who keeps playing must  be aware of the odds against them. So if we take Ruhlen's work in this sense (whether or not he does himself) then it's a lot more reasonable: it's the best prediction of what might have been the case many millennia ago with the caveat that it's almost certainly wrong.

The real problem with this position is that we know almost nothing about PIE with any real certainty. I do think something (which we now call PIE) existed, somewhere, at some time. I think most linguists would agree to that -- the IE languages are related. Beyond that, basics like reconstructed words or the homeland are constantly debated. And it's highly arbitrary to take PIE as the poster case for reconstruction, when it also happens to be the most researched.

So we know little about PIE with certainty, we know less about earlier relationships with less certainty as well. But there's no particular cutoff point, at least not one that is accessible to us.

All of the data is constantly cycling, so that by the time of PIE much of it isn't of any use for reconstruction for us. So there's a barrier of sorts for PIE too-- at least for many lexical items, etc. Possibly even the homeland. But we don't know exactly what. And likewise, going back further, there are probably _some_ things we can do with reasonable accuracy, even if we cannot know for certain whether we're doing them accurately.

My entire point, and I do think this is important, is that we should not talk about barriers but about probabilities. At 100,000 years the probability of the best hypothesis being correct may be (let's say arbitrarily) 1%. And that's that. It's no better or worse than 1%. It is what it is. At the time of PIE, maybe it's something like 60% (again, arbitrarily picking a number).

So from a practical perspective, what is the most useful way to spend our time? Probably at a shallower time depth.

But I also think these details are important for one specific reason: rather than actually being critical of the methodology used by Ruhlen et al, the most common argument is "nah, can't do that!". The bigger problem is the way they're trying to reconstruct ancient relationships. Just comparing surface data is a terrible idea. And if there is any way to go back further I'm certain it's not just with surface data. It needs to be incremental with intermediate reconstructions (along with all of the uncertainty added in that process). While still far from certain, that's going to be more likely to lead us to relevant conclusions about earlier families.

So, my suggestions:

1. Each hypothesis should be judged based on (heuristically) how well it could possibly be determined, not "whether" it can be determined.
2. For each hypothesis space (eg, what's the ancestor of PIE), we should (time permitting and risk/reward deemed worthwhile) find the best hypothesis and consider it along with (1).
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 22, 2014, 08:47:57 PM
I think you're taking it a bit too literally. Let's see if I can't pull of a somewhat decent practical analogy ...

Event horizon
Loosely put, and to continue with the black hole analogy, (the distance to) the statistical event horizon is (directly) related to the mass (of information):

$\varepsilon \propto m$

The more information you have the farther away the event horizon moves — it is not a static and immovable concept.

Visible horizon
Likewise, and to continue with the horizon analogy, (the distance to) the statistical visible horizon is (directly) related to the height (of understanding):

$\eta \propto h$

The more understanding you have the farther away the visible horizon moves — it is not a static and immovable concept.

Thus it is not a simple function of time, $f(t)$, but rather, $f(\varepsilon, \eta) \propto (m, h)$, which may or may not directly correspond to any explicit scale or depth of time. You are focusing too greatly on an explicit time depth, when neither I nor malfet have even hinted at any such thing. Scale or depth of time may very well be nestled deep somewhere in the equation, but it is not a simple one-to-one correspondence, neither have I nor malfet hinted at such a simple relation.

E.g. tell me, using NOTHING but the words "four" (English) and "quattuor" (Latin), can you show me that they are related?
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 22, 2014, 11:44:31 PM
Quote
which may or may not directly correspond to any explicit scale or depth of time
Exactly. And this means that Ruhlen's work might be correct. It's just very unlikely.

Overall, I believe we agree.

But I do object to the phrasing that we can object to Ruhlen's work because it's "too early" or anything like that. More reasonably, it's almost certainly too early. I realize that sounds like a minor objection, but I think it's important to effectively show the problems by not stating the objections hyperbolically.

For the record, here's a good documentary (a bit dated, but still relevant, and with all the big names in it) on the subject:

As for an explicit mention of a "limit", see this part:
http://youtu.be/J0phq7litTc?t=31m5s

Overall, Ringe is more correct than Ruhlen. But I still object to Ringe's phrasing.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 23, 2014, 12:54:20 AM
Quote
which may or may not directly correspond to any explicit scale or depth of time
Exactly. And this means that Ruhlen's work might be correct. It's just very unlikely.

Anything might be correct, anything might be incorrect — however, what matters is what you can demonstrate; hence the event horizon.

The difference is that while the cosmic event horizon is static and immovable due to the properties of light, the statistical (or perhaps epistemological) event horizon is dynamic and movable due to the properties of knowledge.

I would dare say that at this point in time we do not have the knowledge necessary to move the event horizon back far enough to be able to utilise comparative language to such a degree — that might change in the future, but that doesn't remove the event horizon.

Therefore, what today brings us beyond the event horizon and into the indistinguishable sea of noise, might not be the case tomorrow; but tomorrow comes tomorrow.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 23, 2014, 02:46:50 AM
So there may be an event horizon, but we don't know where it is, so there's no point in operating based on that. Rather, we operate based on the probability that what we do (in research) may be useful or insightful. So earlier is better. As a heuristic 5,000 years is ok, while 100,00 is not. But that's not because we've identified a limit at 10,000-12,000 years as Ringe claims for example :)
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 23, 2014, 03:10:25 AM
So there may be an event horizon, but we don't know where it is, so there's no point in operating based on that. Rather, we operate based on the probability that what we do (in research) may be useful or insightful. So earlier is better. As a heuristic 5,000 years is ok, while 100,00 is not. But that's not because we've identified a limit at 10,000-12,000 years as Ringe claims for example :)

I would say it's similar to our thermoception: the closer we get the stronger the sensation/awareness; but there is no solid barrier to touch so it feels like a continuously growing gradient, burning ever hotter.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 23, 2014, 07:38:24 AM
You, in this thread. And others I've heard quoted against the Ruhlen et al position.

Quote from: MalFet
Virtually all serious work in historical linguistics takes it as axiomatic that there's a hard horizon beyond which we won't be able to reconstruct proto-languages reliably.
Quote from: freknu
you statistically cannot separate any possible valid relations from the noise — making comparative linguistics impossible past a certain point, the event horizon
Quote from: freknu
As djr mentioned, exponential — but at some point you still reach a situation where the SNR !>1, and you can now stare yourself blind at the event horizon without making any further progress.
Quote from: MalFet
But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.

In some sense, there may be such a barrier, but we cannot know precisely where it is (5,000 years? 10,000 years? 20,000 years? more?), and this all depends on the available data, so that, actually in theory, we could track the languages back farther in time given more data. So there is no barrier per se, except as a limit to what we have available to us (data), and we can never actually know where that barrier is. We can make some pretty reasonable guesses about what to do and what not to do, but it's an unscientific position to point to a barrier and then attempt to give it some kind of cutoff date.

People do not point to a barrier and then attempt to give it some kind of cutoff date. I have not done that, freknu has not done that, and nobody arguing against Ruhlen has done that. That's just not what's happening here. If you're interested in this stuff, I'd encourage you to do some of the actual number-crunching yourself sometime. As it stands, however, you seem to be missing some fundamentals and as a consequence are misunderstanding what I (and everyone else) are actually saying.

As with many statistical phenomena, the certainty of historical reconstructions does not scale with the quality of available data in a linear way. This is a very important fact. If your data is 50% as good as some baseline, your reconstructions will be substantially *less* than 50% as reliable as the baseline's. When you start building probabilistic reconstructions from probabilistic reconstructions from probabilistic reconstructions, you approach randomness not by steps but by leaps.

In other words, proper stochastic reconstructions *begin* with the understanding that a certain extent of apparent concordance will be present in every comparison by mere random chance, and there are tried-and-true ways of quantifying this extent of chance. Once you get past a certain point, the system contains enough mutation to push the similiarities into a band of probability that makes it fundamentally impossible to separate from chance. This is not pointing at an arbitrary barrier, as you keep insisting. It is a basic, empirical observation emerging from the character of the data itself. That's just how the math works. At a certain point, your measured reliability just drops off a cliff. This is not strictly a function of time, but in many of the world's large language families that precipice tends to sit (because of the data!) right around 6-10k y.a.

So, my suggestions:

1. Each hypothesis should be judged based on (heuristically) how well it could possibly be determined, not "whether" it can be determined.
2. For each hypothesis space (eg, what's the ancestor of PIE), we should (time permitting and risk/reward deemed worthwhile) find the best hypothesis and consider it along with (1).

What you are describing here is "science", plain and simple. It's what everyone worth their beans already does. Like I said, you're arguing against strawmen.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 23, 2014, 01:57:20 PM
Quote
People do not point to a barrier and then attempt to give it some kind of cutoff date. I have not done that, freknu has not done that, and nobody arguing against Ruhlen has done that.
See Ringe's comments in that youtube video. (The video itself covers the positions of most people involved in this debate, so it's worth seeing, if you haven't seen it.)

And the problem is that people seem to dismiss Ruhlen's position without doing that number crunching you talk about. It's unlikely that Ruhlen's conclusions are correct. We all agree there. But it's not a very good argument to point to some unknown limit and speculate that it's probably before 100,000 years (or whatever), when we don't really know. Instead, it's much simpler to just point out that it's very hard to go that far back and therefore very unlikely that anything at 100,000 years (or whatever) is reliable.

Quote
Like I said, you're arguing against strawmen.
What's the coherent and specific argument against Ruhlen et al then?

Quote
What you are describing here is "science", plain and simple.
Indeed. So why this talk of some arbitrary date past which we can't do reconstruction? It's hinted at in this thread and it is stated explicitly by Ringe.

Quote
Once you get past a certain point, the system contains enough mutation to push the similiarities into a band of probability that makes it fundamentally impossible to separate from chance.
WHAT POINT?
That's my entire objection. You're hand-waving if you can't give a date (or general window) for that point.
And, yes, that "point" is the barrier/wall/limit I've been talking about.

Quote
This is not pointing at an arbitrary barrier, as you keep insisting. It is a basic, empirical observation emerging from the character of the data itself. That's just how the math works. At a certain point, your measured reliability just drops off a cliff. This is not strictly a function of time, but in many of the world's large language families that precipice tends to sit (because of the data!) right around 6-10k y.a.
Ok! And there's a year. Good.
So:
1. Can you point to some research that defends those dates? This is one particular area that I haven't looked into.
2. Does that apply to every conceivable method of dealing with the data? Is that date not boosted by comparing reconstructions to reconstructions? (Obviously the reliability goes down over time with any method, but there need not be such a hard limit necessarily.)
3. If you claim specifically there is indeed a 6-10,000 year limit, then you are disagreeing with my point above, that we cannot locate such a limit. And that's fine. I'll gladly be wrong about that, but then we should discuss that detail rather than the bigger picture. And that's a good thing-- specifics are important here.

(I wrote a longer reply, but I truncated it after realizing the issue really is whether we can determine this 6-10,000 year limit. Let's focus on that.)
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 23, 2014, 11:31:37 PM
I forgot to post the link to the criticism:

What can we learn about the earliest human language by comparing languages known today? (http://www2.hawaii.edu/~lylecamp/Campbell%20OriginsProofs.pdf), Lyle Campbell

Quote
So, what can we find out or reasonably hypothesize about the earliest human language(s) from looking back from evidence in modern and attested older languages? We can speculate, perhaps even reasonably in some cases, but we can 'know' extremely little. What can we find out from lexical comparisons? Answer: essentially nothing, though we can learn object lessons from the many problems found in the methods which have been utilized to attempt to get at 'global etymologies.' Perhaps because of the assumption that all the world's languages are genetically related, descendants of 'Proto-World,' global etymologists are disposed to believe in etymological connections among words in contemporary languages, and this will to believe permits them to accept as related forms which do not exceed sheer accidental similarity as a more plausible explanation. I conclude with Bender (1993:203), ''global etymologies' are an illusion. They are an artifact of too much freedom of choice and the loss of control.' The global etymologists have not met their burden of proof. In the long time since the origin of human language(s), so much vocabulary replacement has taken place that in effect no forms once found in 'Proto-World' could have survived. Moreover, if some form had survived (and I assert it did not), after so much change it could not be recognized, and, if it should preserve a recognizable shape (and again I assert it could not), there would be so few such surviving forms that it would be impossible to distinguish successful survivors from forms similar by sheer accident. In short, the search for global etymologies is at best a waste of time, at worst an embarrassment to linguistics as a discipline, confusing and misleading those who might look to linguistics for understanding in this area.

What can we find out Proto-World from structural comparisons? Answer: nothing especially useful, though functional typological and structural considerations may provide broad guidelines to what even the earliest human language would have to have in order to qualify as a human language. Again, though, we learn object lessons from the problems encountered in such structural comparisons. In particular, we learn that there is no correlation to be found between size of speech community or social organization and structural aspects of languages. We can speculate that the design features of human language give us a small handle on the necessary nature of the earliest human language(s), but these are so broad that essentially any linguistic structure known in any language today would qualify as possible.

Quite a biting conclusion. I'm not so sure I agree with everything he says, but I can't say I see any major issue with his conclusion, either.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 24, 2014, 03:50:04 AM
Quote
In particular, we learn that there is no correlation to be found between size of speech community or social organization and structural aspects of languages.
A lot of linguists would disagree. I'm not sure that I do. But the idea that culture and language are closely connected is certainly popular.
At least one property of small speech communities is the use of local geography in the grammatical system. That disappears with globalization (if not much sooner). And there are probably more things. I'm not sure why Campbell wrote that particular part.

Quote
We can speculate that the design features of human language give us a small handle on the necessary nature of the earliest human language(s), but these are so broad that essentially any linguistic structure known in any language today would qualify as possible.
This I agree with. (But of course, many would not-- those who believe that UG is fairly constrained, for example.)

As for the issue of reconstruction, I still want to go back to my earlier point that the best way (to whatever degree there is any chance of a good way) would be through intermediate reconstructions. Surface comparisons of modern languages is silly, for the reasons outlined well above-- we'd expected differences, not strong similarities at this point. But none of this means we absolutely can't get back that far.

Additionally, there is one interesting perspective that Campbell, I assume unintentionally, hints at: if the issue is borrowing and the messiness of data, one odd but interesting argument for a Ruhlen type approach to Proto-World is that borrowing is indeed accounted for at that level. Comparing these words across all languages, and including all changes and all borrowing, there may be some way to just average that all out and see what Proto-World was like. Overall it's still not going to happen. But at the level of Proto-World, borrowed words aren't "noise" per se, but rather just more data, with a different path.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 25, 2014, 07:29:33 PM
Quote
Once you get past a certain point, the system contains enough mutation to push the similiarities into a band of probability that makes it fundamentally impossible to separate from chance.
WHAT POINT?
That's my entire objection. You're hand-waving if you can't give a date (or general window) for that point.
And, yes, that "point" is the barrier/wall/limit I've been talking about.

I know this is your entire objection, but it's the wrong objection. That's what I keep trying (but failing) to explain. There is no simple "point" in historical time beyond which reconstruction is impossible. That's just not how it works. For any particular dataset, however, there are some hard methodological limits on how much mutation can be reliably extrapolated in the absence of actual, attested datapoints. Any good textbook on historical linguistics should cover this.

When you start postulating beyond the reach of your dataset (as Ruhlen insists on doing), it becomes statistically impossible to distinguish between similarity as a consequence of relatedness and similarity as a consequence of chance. Put simply, Ruhlen is making **** up and telling people to prove him wrong. That's not science. It's not even mediocre science. It's sitting around the campfire telling just-so stories.

What's the coherent and specific argument against Ruhlen et al then?

Precisely what I've said in every post in this thread. Most historical linguists feel that it is necessarily to test claims of relatedness against the null hypothesis...namely, that observed similarities between languages are the product of coincidence. Ruhlen does not do this, and when others have done it for him his hypotheses tend to collapse under the weight of analysis.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 25, 2014, 07:44:41 PM
I agree with you completely, except that I see this as a gradient distinction, not a categorical one.

Quote
For any particular dataset, however, there are some hard methodological limits on how much mutation can be reliably extrapolated in the absence of actual, attested datapoints. Any good textbook on historical linguistics should cover this.
Can you be more specific?

Certainly there are limits, but I'm not sure how we know what those limits are. I can tell which argument is a better argument, but not exactly where a hard limit is.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 25, 2014, 09:29:39 PM
I agree with you completely, except that I see this as a gradient distinction, not a categorical one.

Quote
For any particular dataset, however, there are some hard methodological limits on how much mutation can be reliably extrapolated in the absence of actual, attested datapoints. Any good textbook on historical linguistics should cover this.
Can you be more specific?

Certainly there are limits, but I'm not sure how we know what those limits are. I can tell which argument is a better argument, but not exactly where a hard limit is.

If I try to throw a quarter into a cup, the probability that I will be successful decreases as the cup gets farther away from me. This decrease is a gradient change. If the cup is sitting on the moon, however, it's pretty darn categorical that I'm not going to land a quarter in it.

If you're interested in historical reconstruction, you might look at something like Anthony Fox's book. He doesn't provide much by the way of cutting edge statistical methodology, but his historical overview is the next best thing to actually doing some of this work yourself. The material on transformation chains, in particular, should make it very clear why Ruhlen's claims sit a few million miles north of this notional "hard limit" that you keep wanting to locate.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 25, 2014, 10:57:48 PM
Quote
If I try to throw a quarter into a cup, the probability that I will be successful decreases as the cup gets farther away from me. This decrease is a gradient change. If the cup is sitting on the moon, however, it's pretty darn categorical that I'm not going to land a quarter in it.
That's exactly the sort of hand waving that I'd like to avoid. It's an unscientific argument against what you're claiming is an unscientific approach.

Again, I agree with you. This is all intuitive and Ruhlen is looking way too early. But I don't know that for scientific reasons.

Scientifically, all I know is that you almost certainly won't hit the cup. But you might.

Quote
The material on transformation chains, in particular, should make it very clear why Ruhlen's claims sit a few million miles north of this notional "hard limit" that you keep wanting to locate.
So let's, for argument's sake, assume that the hard limit is somewhere between 1,000 and 50,000 years ago, still far from the reconstructions Ruhlen proposes. How do we establish with relative certainty that the limit is indeed within that window?
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 26, 2014, 01:23:01 AM
Scientifically, all I know is that you almost certainly won't hit the cup. But you might.

Scientifically, there is no evidence to support the notion that he could complete said task. If you have evidence to the contrary, you are free to prove how he could do it — the burden of proof is on you.

So let's, for argument's sake, assume that the hard limit is somewhere between 1,000 and 50,000 years ago, still far from the reconstructions Ruhlen proposes. How do we establish with relative certainty that the limit is indeed within that window?

Statistics is not one of my stronger points, but here goes ...

1. attested data point
2. extrapolated data point
3. extrapolated data point, k=√(k1²+k2²)
4. extrapolated data point, k=√(k1²+k2²+k3²)
5. extrapolated data set, k=√Σki²

How long before your uncertainty is so far off the scale that you need a magnifying glass just to see it?

There is not just a single step from modern language to PIE, it already involves decades of work and layers upon layers of reconstructions and uncertainties. Going from PIE to any hypothetical earlier superfamily is likewise going to require many layers of reconstructions, it's not going to be one single swift step.

As Campbell mentioned in his critique, the time depth of PIE is around 6,000 years, and to find any relation to Proto-Uralic (probably the nearest and most likely candidate) to form Proto-Indo-Uralic one might very well add another 6,000 years worth of change. At least PIE and PU are reconstructed from attested data points which "document" the changes over time; PIU on the other hand would be entirely reconstructed from extrapolated data points. Even further back and you have extrapolated data points from extrapolated data points from extrapolated data points ... and so on.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 26, 2014, 03:42:56 AM
To summarize my position: I'm not at all advocating Ruhlen's method. But I want to question the specific arguments against it. I think pushing these limits back is a very interesting topic, and I'm far from convinced we'll find some kind of hard limit at, for example, 10,000 years. Instead, I think it will become increasingly difficult to push it back further and further. But I see no reason why we cannot continue to push that limit back, very slowly, with continued research.

Details:
Quote
Scientifically, there is no evidence to support the notion that he could complete said task. If you have evidence to the contrary, you are free to prove how he could do it — the burden of proof is on you.
Scientifically, we only disprove (falsify) things. So the burden of proof is on proving that Ruhlen cannot do this-- otherwise it's just hand-waving, which is what I've been saying.
I don't need to "prove" that he can do what he's doing. I'm actually seeking a way to prove that he cannot, if there is indeed such a way.

Quote
How long before your uncertainty is so far off the scale that you need a magnifying glass just to see it?
Of course. But why believe anything about PIE then?
The whole point is that if you do it well enough, maybe that result you can only see with a magnifying glass is correct. So the burden of proof is on you to show why you cannot possibly, even with a magnifying glass, come to a reasonable conclusion at 100,000 years ago.
And I've said we can't: for the same reasons you showed, that it becomes statistically uninformative over time, so earlier is less information -- we'd need a magnifying glass then. But that still doesn't provide us with a cutoff between "can" and "can't" time periods.

Quote
There is not just a single step from modern language to PIE, it already involves decades of work and layers upon layers of reconstructions and uncertainties. Going from PIE to any hypothetical earlier superfamily is likewise going to require many layers of reconstructions, it's not going to be one single swift step.
I agree. And this is the weakest part of Ruhlen's approach, just skimming the surface of modern languages. It doesn't mean we "can't" go back farther, just that it would be hard to do so. It might be impossible, but this just means it is hard.

Quote
As Campbell mentioned in his critique, the time depth of PIE is around 6,000 years, and to find any relation to Proto-Uralic (probably the nearest and most likely candidate) to form Proto-Indo-Uralic one might very well add another 6,000 years worth of change. At least PIE and PU are reconstructed from attested data points which "document" the changes over time; PIU on the other hand would be entirely reconstructed from extrapolated data points. Even further back and you have extrapolated data points from extrapolated data points from extrapolated data points ... and so on.
Indeed. And what's, technically speaking, wrong with that?
It's obviously a lot less reliable than something easier, but that's why these hard questions are interesting. We might be able to come up with a good theory about these things. We might not.

To me, a very interesting question is just how far back we can push these limits. I think it's crazy to start at 100,000 years ago. But I would be very interested in exactly the question you pose: what's just before PIE? However, to answer that question we need to work with many complex details of PIE itself (and other proto-languages), some of which are still being worked out. I think it's incredibly hard but not necessarily impossible.

Another major issue here is the rate of change. While 6,000 years of PIE might show a lot of change, maybe 6,000 more years before that would not. There does seem to be evidence that intense language contact increases the rate of change, and this is exactly what happened to make IE spread out so much. Before that it may have been relatively unchanging for 6,000 years. If that is the case, then going back farther would be relatively easier, helping us to make up for some of the limited data.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 26, 2014, 04:12:41 AM
To summarize my position: I'm not at all advocating Ruhlen's method. But I want to question the specific arguments against it. I think pushing these limits back is a very interesting topic, and I'm far from convinced we'll find some kind of hard limit at, for example, 10,000 years. Instead, I think it will become increasingly difficult to push it back further and further. But I see no reason why we cannot continue to push that limit back, very slowly, with continued research.

No one, absolutely no one has been arguing that it is a static and immovable limit, that was the whole point of my attempt at an analogy — it is dynamic and movable. Hence according to the properties of your current set of data and understanding you approach a statistical limit where any result is pretty much indistinguishable from noise — pure chance. It is entirely dependent upon your current set of data and understanding, increase the set and the circumstances are no longer equal and the previous limits no longer apply. Hence it is always possible to push it farther and farther back, provided that your set of data and understanding allow this.

Quote
Scientifically, there is no evidence to support the notion that he could complete said task. If you have evidence to the contrary, you are free to prove how he could do it — the burden of proof is on you.
Scientifically, we only disprove (falsify) things. So the burden of proof is on proving that Ruhlen cannot do this-- otherwise it's just hand-waving, which is what I've been saying.

Que!? The burden of proof is on proving that Ruhlen's material is valid ... the bloody hell :/ It's his job to prove his claim, it's not my bloody job to prove it wrong. Not being able to disprove something does not make it correct, it only makes it unfalsifiable.

Another major issue here is the rate of change. While 6,000 years of PIE might show a lot of change, maybe 6,000 more years before that would not. There does seem to be evidence that intense language contact increases the rate of change, and this is exactly what happened to make IE spread out so much. Before that it may have been relatively unchanging for 6,000 years. If that is the case, then going back farther would be relatively easier, helping us to make up for some of the limited data.

Rate of change is certainly and intriguing question, but PIE would still have had to change considerably (as would the other families) to allow for a merging into a superfamily. It may very well be much slower than currently, but that doesn't exactly help, as it only means there was a particularly stable pre-PIE stage — it doesn't give you a get-out-of-jail-free card.

It also ignores the fact that apart from a very few select words of core vocabulary, there may simply not be any cognates available. Going from modern language to PIE you are left with a thousand or so roots (even less if you only count strict roots, and even less if you only count known reduced roots). How many of these do you think are shared by all later branches of IE? How many do you think will be left, if hypothetically, it was possible to merge a handful of families with cognates is all families?
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 26, 2014, 05:48:35 AM
Quote
No one, absolutely no one has been arguing that it is a static and immovable limit, that was the whole point of my attempt at an analogy — it is dynamic and movable. Hence according to the properties of your current set of data and understanding you approach a statistical limit where any result is pretty much indistinguishable from noise — pure chance. It is entirely dependent upon your current set of data and understanding, increase the set and the circumstances are no longer equal and the previous limits no longer apply. Hence it is always possible to push it farther and farther back, provided that your set of data and understanding allow this.
This still doesn't address then why working on PIE or Proto-PIE is acceptable, while Proto-World is not. That categorical distinction is not yet supported.

Quote
Que!? The burden of proof is on proving that Ruhlen's material is valid ... the bloody hell :/ It's his job to prove his claim, it's not my bloody job to prove it wrong. Not being able to disprove something does not make it correct, it only makes it unfalsifiable.
Of course-- you can show that it's unfalsifiable (and thus unscientific) as well. I didn't mean to suggest otherwise.
But to be fair to Ruhlen, it really isn't possible to prove it correct or prove that he can really reconstruct Proto-World. It's only possible to falsify, or to show that it is unfalsifiable.
Just because an approach seems strange doesn't mean we can dismiss it (except from a practical "I don't want to do that personally" perspective), unless we can falsify it or demonstrate unfalsifiability.

Quote
Rate of change is certainly and intriguing question, but PIE would still have had to change considerably (as would the other families) to allow for a merging into a superfamily. It may very well be much slower than currently, but that doesn't exactly help, as it only means there was a particularly stable pre-PIE stage — it doesn't give you a get-out-of-jail-free card.
That's a good point, and one of the few pieces of concrete evidence we have: we'd need to go back in time far enough and through enough changes to eliminate the diversity found in any two families to be related through these methods. Of course it's possible some variation would be eliminated via the first round of reconstruction (proto-languages for each) but there would still be a good bit to account for.

Quote
It also ignores the fact that appart from a very few select words of core vocabulary, there may simply not be any cognates available. Going from modern language to PIE you are left with a thousand or so roots (even less if you only count strict roots, and even less if you only count known reduced roots). How many of these do you think are shared by all later branches of IE? How many do you think will be left, if hypothetically, it was possible to merge a handful of families with cognates is all families?
This gets us into more interesting questions. Ruhlen et al claim that some particular roots are more likely to survive due to frequent use.
You say 1000 in PIE. That might be because 90%+ of the vocabulary disappears over 6,000 years, or it may be because those 1000 words are the sort that stick around. I'd guess a combination. There are some issues with reconstruction, such as names for particular local tree species. Therefore, those will be eliminated. But the core 1000 may be the sort of words that would potentially stay in use for 6,000 or 12,000 years because they aren't, semantically/culturally, likely to be discarded. There's still the potential for borrowing and so forth.

In the end, this may give us a better approximation of the statistics:
0 years = 10,000+ words
6,000 years = 1,000 words
12,000  years = 100 words??
18,000 years = 10 words?
25,000 years = ~1 word?

There's a lot of guesswork in that. And we'd need to come up with supporting evidence. If that's true, then the 100,000 year time depth isn't going to work at all. But if the rate of losing words changes, perhaps at 100,000 years there could still be 10 core words left. These are unclear issues, and I have no idea how we'd hope to get data to tell one way or the other.

----
Also, importantly, I think we need to determine one more thing:
Aside from time-depth, what do we hope to do at that time-depth? I can think of a few goals:
1) Show relatedness of languages.
2) Reconstruct the proto-language.
3) Show development from proto-language to all daughter languages.

Those three tasks are increasingly difficult. At the very least I imagine we can potentially make great progress toward (1) even significantly earlier than PIE. Beyond that, it's much less likely. And for (3) we still have a lot of work to do for PIE, though chunks of (2) are already (reasonably convincingly) accomplished.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 26, 2014, 06:09:56 AM
Quote
If I try to throw a quarter into a cup, the probability that I will be successful decreases as the cup gets farther away from me. This decrease is a gradient change. If the cup is sitting on the moon, however, it's pretty darn categorical that I'm not going to land a quarter in it.
That's exactly the sort of hand waving that I'd like to avoid. It's an unscientific argument against what you're claiming is an unscientific approach.

Again, I agree with you. This is all intuitive and Ruhlen is looking way too early. But I don't know that for scientific reasons.

Scientifically, all I know is that you almost certainly won't hit the cup. But you might.

Quote
The material on transformation chains, in particular, should make it very clear why Ruhlen's claims sit a few million miles north of this notional "hard limit" that you keep wanting to locate.
So let's, for argument's sake, assume that the hard limit is somewhere between 1,000 and 50,000 years ago, still far from the reconstructions Ruhlen proposes. How do we establish with relative certainty that the limit is indeed within that window?

With the statistical entropy metrics I've been referencing throughout this thread. Some of the math depends on how much contextual phonology you're willing to bake into your model, but nothing from structuralism is going to get you much past 2-3 postulated intermediate states. Ruhlen is working with an arbitrarily large number. Alternately, you can use a purely stochastic model like the one jkpate linked to earlier for long reach but less specificity, but Ruhlen is most certainly not doing the legwork necessary for that either.

As I've said, any good textbook on historical linguistics, information theory, or plain ol' statistics will cover all this for you. I'm not sure why you're so willing to dismiss all that out of hand as "unscientific" when you seem not to know the first thing about it, but that's your choice. Perhaps you just like throwing quarters at the moon.

The rest of us, on the other hand, realize something very important: a quarter weighs 5.67 grams, a typical human's ulnar collateral ligament will break somewhere around 80 N*m, and the escape velocity of earth is 11.2 km/s.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 26, 2014, 06:22:06 AM
You say 1000 in PIE. That might be because 90%+ of the vocabulary disappears over 6,000 years, or it may be because those 1000 words are the sort that stick around. I'd guess a combination. There are some issues with reconstruction, such as names for particular local tree species. Therefore, those will be eliminated. But the core 1000 may be the sort of words that would potentially stay in use for 6,000 or 12,000 years because they aren't, semantically/culturally, likely to be discarded. There's still the potential for borrowing and so forth.

In the end, this may give us a better approximation of the statistics:
0 years = 10,000+ words
6,000 years = 1,000 words
12,000  years = 100 words??
18,000 years = 10 words?
25,000 years = ~1 word?

Yikes, no.

That's not how phonological change happens, not in the slightest. Exponential decay is precisely the wrong metaphor. It's all chain shifts, all the way down. You keep wanting to look at this in terms of x% change per y years, but that's a fundamental misunderstanding of the data and the methodology.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 26, 2014, 06:40:09 AM
Quote
Perhaps you just like throwing quarters at the moon.
Not at all. But my reasons for not doing what Ruhlen is doing are not scientific reasons: they're common sense reasons. I'd like a scientific reason to back this up.

Quote
The rest of us, on the other hand, realize something very important: a quarter weighs 5.67 grams, a typical human's ulnar collateral ligament will break somewhere around 80 N*m, and the escape velocity of earth is 11.2 km/s.
Strawman. And that's exactly the problem: in historical linguistics, we don't have any of those accurate measurements. We don't know how quickly languages change, we don't know what words are preserved and which are lost, we don't know how old language is, we have any way to test our predictions, and so forth.

Quote
With the statistical entropy metrics I've been referencing throughout this thread.
To be fair, hinting at. I asked for a reference. The closest you gave was "any historical linguistics textbook". I teach with one. I've read a handful. They certainly cover this in general terms, but they don't address the technical details at a scientific level, just descriptive. Maybe I missed it, not looking for this when I read the books, so feel free to point me in the right direction if so.

Quote
Ruhlen is working with an arbitrarily large number. Alternately, you can use a purely stochastic model like the one jkpate linked to earlier for long reach but less specificity, but Ruhlen is most certainly not doing the legwork necessary for that either.
Here I'll certainly agree-- Ruhlen is taking every liberty possible and then making some arbitrary guesses. That's a problem. But aside from throwing it out just for that, I am genuinely wondering what can be shown scientifically (unlike his methodology) to demonstrate that he's wrong to even attempt such a task. So far I haven't seen that argument.

As for jkpate's suggestion, it's interesting, but I specifically ignored that because it can't be the optimal methodology, in theory. So any weaknesses are irrelevant if there may be a better approach.

Quote
That's not how phonological change happens, not in the slightest. Exponential decay is precisely the wrong metaphor. It's all chain shifts, all the way down. You keep wanting to look at this in terms of x% change per y years, but that's a fundamental misunderstanding of the data and the methodology.
My comment was directed at freknu in terms of lexical statistics, how many words are reconstructed for PIE. I wasn't referring to phonological changes, at least not directly. And this also appears to be the explanation given by Ringe in the video I linked to earlier. Borrowing (and other kinds of lexical change) are responsible for the time-depth barrier, not (just) grammatical/phonological change.
Quote
Some of the math depends on how much contextual phonology you're willing to bake into your model, but nothing from structuralism is going to get you much past 2-3 postulated intermediate states.
Ok, and with these two comments I think we're getting somewhere: you're talking specifically about phonology, and the argument is within that context, rather than the big picture.
Extending that and rephrasing it to make sure I know what you mean...

As we go back in time and layer sound changes, they become undecidable and opaque from our perspective. Sounds change from A to B then back to A, so that from a flat (synchronic) perspective, the diachrony is flat as well, hiding all of the important changes from us. By having a small (say <50) inventory of phonemes in an average language, and by having many changes over time, we necessarily run into such problems. The result is that, much like "flattening" an image in Photoshop, we can't reconstruct the layers without all of that missing information. We're left with a narrow view.

That's an interesting perspective, and likely one that is easier to defend (statistically) than some others. The trouble is that we still don't have all of the necessary data to know how quickly sound changes occur and so forth. So we don't have a baseline or a control, and we can't really know for sure how far back we can look.

Further, there's the interesting point made (in the video, for example) that some words seem to actually look like what they used to be in PIE, at least in some cases. Selecting arbitrary examples (as in the video) isn't too convincing, but the possibility is intriguing. I believe it was freknu who posted some evidence about similarities in agreement inflections in IE and Uralic, and maybe that's the sort of place we'd find evidence.

In the end, we all agree Ruhlen is looking too far back. I just want to know the best way to show that :)
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: jkpate on June 26, 2014, 08:30:12 AM
To me, a very interesting question is just how far back we can push these limits. I think it's crazy to start at 100,000 years ago. But I would be very interested in exactly the question you pose: what's just before PIE? However, to answer that question we need to work with many complex details of PIE itself (and other proto-languages), some of which are still being worked out. I think it's incredibly hard but not necessarily impossible.

The issue is not that the information about very ancient languages is hard to get, it's that the information just isn't there. To a first approximation we can view language change as a markov chain; a language at time t randomly changes into a language at time t+1, according to some transition kernel. The markov chain of language change may be ergodic, which essentially means that any language can eventually turn into any other language with non-zero probability. If it is ergodic, then, for large enough n, the language at time t is asymptotically statistically independent of the language at time t+n. In information-theoretic terms, the reconstruction task is sensitive to the conditional entropy of the language at time t, given the language at time t+n, which in turn is the entropy of both languages minus the entropy of the later, known language:

$H(L_{t} | L_{t+n} ) = H( L_{t} , L_{t+n} ) - H( L_{t+n} )$

However, if n is large enough that the languages are statistically independent, then we have

$H( L_{t} , L_{t+n} ) = H( L_{t} ) + H( L_{t+n} )$

So by substitution:

$H(L_{t} | L_{t+n} ) = H( L_{t} ) + H( L_{t+n} ) - H( L_{t+n} ) = H( L_{t} )$

That is, our uncertainty about the language to be reconstructed is the same, whether or not we know the later language. Our reconstruction does not depend on the evidence. Hard work can give us a better idea of what the transition kernel is (this is essentially the panchronic phonology that MalFet mentioned), but that does not address the core issue, which is the disappearance of information about earlier states. Information about the transition kernel is hard to get. Information about the state at time t is gone.

The worry djr33 raises may be implemented here by worrying about just how large n needs to be, and how to translate this into years. That's still a valid worry; as far as I know, there's no way to derive how large n needs to be (and there's a huge incentive to be able to do this for general ergodic chains). I just wanted to distinguish this case, where information is gone, from the case where information is hard to get.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 26, 2014, 09:55:30 AM
Quote
The rest of us, on the other hand, realize something very important: a quarter weighs 5.67 grams, a typical human's ulnar collateral ligament will break somewhere around 80 N*m, and the escape velocity of earth is 11.2 km/s.
Strawman. And that's exactly the problem: in historical linguistics, we don't have any of those accurate measurements. We don't know how quickly languages change, we don't know what words are preserved and which are lost, we don't know how old language is, we have any way to test our predictions, and so forth.

We don't need to know how quickly language changes. We don't need to know which words are preserved and which are lost. We don't need to know how old language is. That's just not how reconstruction works because (and this is the key point!) that's not how language works.

I'm just repeating myself here, and I don't know how else to say it. jkpate provides an excellent summary of the problem in information theoretic terms. The way we do historical reconstruction is by postulating intermediate nodes in a Markov transformation chain and then comparing the resulting power of explanation against the possibility that the observed similarities came about by chance. Thanks to the techniques that jkpate talks about, we've been able to quantify this increasingly well over the last few decades (with the lovely consequence that we can now trace fainter lineages than ever before), but the core principles at stake here are as old as the hills. Heck, it was in exactly these terms that Saussure got this whole field started with his postulated (and later vindicated) laryngeal consonants.

A critique of Ruhlen in scientific terms is simply that his conclusions are not the product of a scientific methodology. He's not, in short, doing what the bolded sentence above requires. Instead, he's gazing at a bowl of tea leaves and telling the world what he sees. He simply does not have access to the information necessary to make the claims he's making.

Quote
With the statistical entropy metrics I've been referencing throughout this thread.
To be fair, hinting at. I asked for a reference. The closest you gave was "any historical linguistics textbook".

Nonsense. I pointed you towards Juliette Blevens and George van Driem, and told you very specifically that Anthony Fox's work on chain shifts in the history of reconstruction should fill in a critical piece that you seem to be missing. I'm happy to provide more, but first I'd have to know why those didn't work for you as starting places. Until then, if you haven't done the work yourself, haven't read the literature, and won't take my word for it, I'm not really sure what else there is to say. I can't divine this all from first principles for you.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 26, 2014, 10:31:22 AM
The worry djr33 raises may be implemented here by worrying about just how large n needs to be, and how to translate this into years. That's still a valid worry; as far as I know, there's no way to derive how large n needs to be (and there's a huge incentive to be able to do this for general ergodic chains). I just wanted to distinguish this case, where information is gone, from the case where information is hard to get.

Indeed, and that's a very good point. Given how quickly languages have been shown to adopt novel phonetic exponents for even relatively constant phonological structures, I can't see any reason why n wouldn't be (at least in theory) remarkably small...much smaller than the time-depth associated to PIE. Of course, in historical reality languages do not change at a maximal or even steady rate, so we're not going to be able to speculate on the quality of our information a priori. We're going to have to derive that from the data itself, with no small appeal to Ockham's Razor.

For the time being, at least, there's no question that the bulk of PIE lies scattered across that penumbra of information recoverability. There are some things we know with near certainty, some things we will probably never know, and then a great many things in between. In that space between, researchers argue vociferously about what can and cannot be reliably seen. Since Ruhlen has no coherent methodology to speak of, it was easy enough for him to walk right past all this troublesome stuff and into his magical city in the clouds.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: freknu on June 26, 2014, 10:39:20 AM
Statistics be damned! That's me off the trolley D:
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: Daniel on June 26, 2014, 07:44:56 PM
Quote from: jkpate
The worry djr33 raises may be implemented here by worrying about just how large n needs to be, and how to translate this into years. That's still a valid worry; as far as I know, there's no way to derive how large n needs to be (and there's a huge incentive to be able to do this for general ergodic chains). I just wanted to distinguish this case, where information is gone, from the case where information is hard to get.
Thanks :) Informative post.
But as you say, we still don't really know exactly when this information disappears. It seems to be sometime before PIE and yet after when Ruhlen is trying to reconstruct. But we have no particular evidence for that. It seems obvious, but supporting that, I still maintain, seems challenging.

Quote
We don't need to know how quickly language changes. We don't need to know which words are preserved and which are lost. We don't need to know how old language is. That's just not how reconstruction works because (and this is the key point!) that's not how language works.
Huh?
If Pre-PIE changed at a rate of 1% of the speed of post-PIE, then surely we could reconstruct, with a little effort, languages ten or twenty thousand years earlier than that. Speed and such are crucial here.
Obviously there is some speed at which change occurs, and there is some corresponding cutoff point where too much information is lost. But again, we can't be certain, due to having no evidence, that the cutoff point is after the time Ruhlen is working with.

Quote
I'm just repeating myself here, and I don't know how else to say it. jkpate provides an excellent summary of the problem in information theoretic terms. The way we do historical reconstruction is by postulating intermediate nodes in a Markov transformation chain and then comparing the resulting power of explanation against the possibility that the observed similarities came about by chance. Thanks to the techniques that jkpate talks about, we've been able to quantify this increasingly well over the last few decades (with the lovely consequence that we can now trace fainter lineages than ever before), but the core principles at stake here are as old as the hills. Heck, it was in exactly these terms that Saussure got this whole field started with his postulated (and later vindicated) laryngeal consonants.
Fine. And you have not given any indication whether a cutoff point is at 5,000, 20,000 or 50,000 years ago.
Therefore, whether or not Ruhlen is doing it correctly, we might be able to reconstruct languages that are 100,000 years old. It's highly implausible, but there's no concrete evidence against that.

Quote
A critique of Ruhlen in scientific terms is simply that his conclusions are not the product of a scientific methodology. He's not, in short, doing what the bolded sentence above requires. Instead, he's gazing at a bowl of tea leaves and telling the world what he sees. He simply does not have access to the information necessary to make the claims he's making.
This I agree with. And this has nothing to do with him doing it "too early" or any of the major complaints I've seen against him. It's just that he's doing it the wrong way. I think almost everyone can agree on that: his methodology is not rigorous and skips steps.

I think I've confused the matter by not making it clear that I'm talking about possible time depths for comparison, not whether Ruhlen is actually using the best methodology. (Still, Ruhlen's methodology is just fine for showing that, say, English and Swedish are related. So it still would be nice to have a metric to show just how far back that methodology can go.)

Quote
Nonsense. I pointed you towards Juliette Blevens and George van Driem, and told you very specifically that Anthony Fox's work on chain shifts in the history of reconstruction should fill in a critical piece that you seem to be missing. I'm happy to provide more, but first I'd have to know why those didn't work for you as starting places.
Hm, ok. My fault-- you did mention a couple names. I'll check out their CVs when I have some time.
Title: Re: Bengtson & Ruhlen, Global Etymology
Post by: MalFet on June 26, 2014, 08:57:14 PM
Fine. And you have not given any indication whether a cutoff point is at 5,000, 20,000 or 50,000 years ago.

This is where I either start pulling my hair out or letting go of this thread. I hope I have the good sense to do the latter.