Bengtson & Ruhlen, Global Etymology

I'm not really sure what you're understanding as "noise" here. In a standard cross-linguistic corpus analysis, the challenge is simply to separate similarities due to relatedness from similarities due to chance, as conditioned by phonological patterns. There are many ways to go about this, but most techniques involve predicting a baseline of similarity expected between unrelated languages and then measuring forward against that standard. With this information, can you know the relatedness of two particular lexical items? Usually not very well. Can you know the relatedness of two large lexical sets? Yes, to a high degree of probability.
All I'm saying is that we never really can know. Finding a suitable baseline is challenging because the comparison of any two languages is not identical to the comparison of others. Certainly we can find a reasonable range and realize that it's hard to compare/reconstruct beyond, say, 15,000 years. But we can't rule out the possibility that there are tiny bits of relevant evidence still there. As I said, it's exponentially less precise, but that doesn't mean there's precisely no information or that there is some "wall" we can't go past. It just means that there's less and less reason to try the farther back we go. That's all I've been saying. This "wall" is a myth, just as much as the mass comparison method. Statistically speaking, it's more correct, but only because it approximates the real situation, not because it really exists.
There's no reason at all we couldn't find some case where we can compare languages 25,000 years old. The odds are against it, and we may never know if that's due to chance or not, but there probably are some cases out there. Our time is, however, likely better spent on other projects.

If your standard is absolute knowledge (whatever that means!), you'll have to find a new field. In science, everything we know is subject to revision in the face of newer and better evidence. That caveat comes pre-baked into how scientific epistemologies use the word "know", and there's nothing particularly unusual about historical linguistics in this regard.
I completely agree. So why did you write "we absolutely can know"? I don't think we can. I think it's always possible that an ancient comparison is correct, but just incredibly unlikely. The only way to know if we're going back "too far" is based on how reliable we want to be. Taken that way, older comparisons aren't so bad, as long as they aren't presented as fact or even reliable guesses.

Further, this goes back to what I said earlier: if we compare reconstructions to reconstructions, that should (probabilistically speaking) reduce some of the noise. Borrowing/contact, for example, would be partially eliminated if we use a well reconstructed version of PIE and compare it to a well reconstructed version of Proto-Uralic. This should boost how far back we can go. I suppose in this case the limit would be simply based on available information. Languages aren't infinite in a reconstruction sense-- we rely on data points like lexical items and perhaps syntactic constructions. So we reconstruct partial languages ("core vocabulary") then would have less to work with and have even less as a result of a secondary reconstruction. (And of course overall it would be less reliable, not to say entirely irrelevant.) I'd be interested in seeing more of this and less number crunching based on surface data
If your standard is absolute knowledge (whatever that means!), you'll have to find a new field. In science, everything we know is subject to revision in the face of newer and better evidence. That caveat comes pre-baked into how scientific epistemologies use the word "know", and there's nothing particularly unusual about historical linguistics in this regard.
I completely agree. So why did you write "we absolutely can know"?

I doubt he meant "it is possible to absolutely know" but rather "it is absolutely possible to know".

One problem is that the SNR is not known. So the probable SNR increases with time depth, but we have no way to know exactly how. Overall, yes, what you said. But we don't ever know when it's too far.

Not true. Definitionally, you never know whether a particular piece of data is signal or noise. That's why it's noise. But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.

Hmm, when we talk about quantifying the degree of entropy in the system, I think about explicit probabilistic models that allow us to compute various entropies and get numbers in bits. While there is a bit of work in this direction, it's far from standard in historical linguistics as I understand it. Is this the kind of work that you are referring to when you talk about quantifying the degree of entropy in the system?
I doubt he meant "it is possible to absolutely know" but rather "it is absolutely possible to know".
Ah, probably. But still, how is is then certain that we can know, when we don't know much of anything in historical linguistics?

jkpate, agreed

In the end, we're all saying basically the same thing, though in different ways: it's crazy to try to reconstruct after, say, 20,000 years, and probably a bad idea even after 10,000. But, no, I still don't see any particular wall.

I don't agree that we don't know much of anything. I'd say we know things with a greater degree of uncertainty than in other fields, even if we usually don't quantify that degree of uncertainty.
One problem is that the SNR is not known. So the probable SNR increases with time depth, but we have no way to know exactly how. Overall, yes, what you said. But we don't ever know when it's too far.

Not true. Definitionally, you never know whether a particular piece of data is signal or noise. That's why it's noise. But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.

Hmm, when we talk about quantifying the degree of entropy in the system, I think about explicit probabilistic models that allow us to compute various entropies and get numbers in bits. While there is a bit of work in this direction, it's far from standard in historical linguistics as I understand it. Is this the kind of work that you are referring to when you talk about quantifying the degree of entropy in the system?

Definitely. Though it's not quite the standard of practice I'd like it to be, work along these lines isn't all that rare either. Most work in panchronic phonology (which, I think it's now fair to say, has won the war for hearts and minds among phonologists, even if the rest of linguistics hasn't yet jumped on board) moves in these directions. I'm thinking in particular of work by Juliette Blevins and George van Driem.

In the end, we're all saying basically the same thing, though in different ways: it's crazy to try to reconstruct after, say, 20,000 years, and probably a bad idea even after 10,000. But, no, I still don't see any particular wall.

Can you find me a linguist who advocates for this "wall" you speak of? You're arguing against a strawman.

You, in this thread. And others I've heard quoted against the Ruhlen et al position.

Virtually all serious work in historical linguistics takes it as axiomatic that there's a hard horizon beyond which we won't be able to reconstruct proto-languages reliably.
you statistically cannot separate any possible valid relations from the noise — making comparative linguistics impossible past a certain point, the event horizon
As djr mentioned, exponential — but at some point you still reach a situation where the SNR !>1, and you can now stare yourself blind at the event horizon without making any further progress.
But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.

In some sense, there may be such a barrier, but we cannot know precisely where it is (5,000 years? 10,000 years? 20,000 years? more?), and this all depends on the available data, so that, actually in theory, we could track the languages back farther in time given more data. So there is no barrier per se, except as a limit to what we have available to us (data), and we can never actually know where that barrier is. We can make some pretty reasonable guesses about what to do and what not to do, but it's an unscientific position to point to a barrier and then attempt to give it some kind of cutoff date. As I've been saying, it's just less and less likely to be productive. There's a certain point where, like playing the lottery against great odds, most linguists will give up. Anyone who keeps playing must  be aware of the odds against them. So if we take Ruhlen's work in this sense (whether or not he does himself) then it's a lot more reasonable: it's the best prediction of what might have been the case many millennia ago with the caveat that it's almost certainly wrong.

The real problem with this position is that we know almost nothing about PIE with any real certainty. I do think something (which we now call PIE) existed, somewhere, at some time. I think most linguists would agree to that -- the IE languages are related. Beyond that, basics like reconstructed words or the homeland are constantly debated. And it's highly arbitrary to take PIE as the poster case for reconstruction, when it also happens to be the most researched.

So we know little about PIE with certainty, we know less about earlier relationships with less certainty as well. But there's no particular cutoff point, at least not one that is accessible to us.

All of the data is constantly cycling, so that by the time of PIE much of it isn't of any use for reconstruction for us. So there's a barrier of sorts for PIE too-- at least for many lexical items, etc. Possibly even the homeland. But we don't know exactly what. And likewise, going back further, there are probably _some_ things we can do with reasonable accuracy, even if we cannot know for certain whether we're doing them accurately.

My entire point, and I do think this is important, is that we should not talk about barriers but about probabilities. At 100,000 years the probability of the best hypothesis being correct may be (let's say arbitrarily) 1%. And that's that. It's no better or worse than 1%. It is what it is. At the time of PIE, maybe it's something like 60% (again, arbitrarily picking a number).

So from a practical perspective, what is the most useful way to spend our time? Probably at a shallower time depth.

But I also think these details are important for one specific reason: rather than actually being critical of the methodology used by Ruhlen et al, the most common argument is "nah, can't do that!". The bigger problem is the way they're trying to reconstruct ancient relationships. Just comparing surface data is a terrible idea. And if there is any way to go back further I'm certain it's not just with surface data. It needs to be incremental with intermediate reconstructions (along with all of the uncertainty added in that process). While still far from certain, that's going to be more likely to lead us to relevant conclusions about earlier families.

So, my suggestions:

1. Each hypothesis should be judged based on (heuristically) how well it could possibly be determined, not "whether" it can be determined.
2. For each hypothesis space (eg, what's the ancestor of PIE), we should (time permitting and risk/reward deemed worthwhile) find the best hypothesis and consider it along with (1).

I think you're taking it a bit too literally. Let's see if I can't pull of a somewhat decent practical analogy ...

Event horizon
Loosely put, and to continue with the black hole analogy, (the distance to) the statistical event horizon is (directly) related to the mass (of information):

$\varepsilon \propto m$

The more information you have the farther away the event horizon moves — it is not a static and immovable concept.

Visible horizon
Likewise, and to continue with the horizon analogy, (the distance to) the statistical visible horizon is (directly) related to the height (of understanding):

$\eta \propto h$

The more understanding you have the farther away the visible horizon moves — it is not a static and immovable concept.

Thus it is not a simple function of time, $f(t)$, but rather, $f(\varepsilon, \eta) \propto (m, h)$, which may or may not directly correspond to any explicit scale or depth of time. You are focusing too greatly on an explicit time depth, when neither I nor malfet have even hinted at any such thing. Scale or depth of time may very well be nestled deep somewhere in the equation, but it is not a simple one-to-one correspondence, neither have I nor malfet hinted at such a simple relation.

E.g. tell me, using NOTHING but the words "four" (English) and "quattuor" (Latin), can you show me that they are related?
which may or may not directly correspond to any explicit scale or depth of time
Exactly. And this means that Ruhlen's work might be correct. It's just very unlikely.

Overall, I believe we agree.

But I do object to the phrasing that we can object to Ruhlen's work because it's "too early" or anything like that. More reasonably, it's almost certainly too early. I realize that sounds like a minor objection, but I think it's important to effectively show the problems by not stating the objections hyperbolically.

For the record, here's a good documentary (a bit dated, but still relevant, and with all the big names in it) on the subject:

As for an explicit mention of a "limit", see this part:
http://youtu.be/J0phq7litTc?t=31m5s

Overall, Ringe is more correct than Ruhlen. But I still object to Ringe's phrasing.

which may or may not directly correspond to any explicit scale or depth of time
Exactly. And this means that Ruhlen's work might be correct. It's just very unlikely.

Anything might be correct, anything might be incorrect — however, what matters is what you can demonstrate; hence the event horizon.

The difference is that while the cosmic event horizon is static and immovable due to the properties of light, the statistical (or perhaps epistemological) event horizon is dynamic and movable due to the properties of knowledge.

I would dare say that at this point in time we do not have the knowledge necessary to move the event horizon back far enough to be able to utilise comparative language to such a degree — that might change in the future, but that doesn't remove the event horizon.

Therefore, what today brings us beyond the event horizon and into the indistinguishable sea of noise, might not be the case tomorrow; but tomorrow comes tomorrow.
So there may be an event horizon, but we don't know where it is, so there's no point in operating based on that. Rather, we operate based on the probability that what we do (in research) may be useful or insightful. So earlier is better. As a heuristic 5,000 years is ok, while 100,00 is not. But that's not because we've identified a limit at 10,000-12,000 years as Ringe claims for example

So there may be an event horizon, but we don't know where it is, so there's no point in operating based on that. Rather, we operate based on the probability that what we do (in research) may be useful or insightful. So earlier is better. As a heuristic 5,000 years is ok, while 100,00 is not. But that's not because we've identified a limit at 10,000-12,000 years as Ringe claims for example

I would say it's similar to our thermoception: the closer we get the stronger the sensation/awareness; but there is no solid barrier to touch so it feels like a continuously growing gradient, burning ever hotter.
You, in this thread. And others I've heard quoted against the Ruhlen et al position.

Virtually all serious work in historical linguistics takes it as axiomatic that there's a hard horizon beyond which we won't be able to reconstruct proto-languages reliably.
you statistically cannot separate any possible valid relations from the noise — making comparative linguistics impossible past a certain point, the event horizon
As djr mentioned, exponential — but at some point you still reach a situation where the SNR !>1, and you can now stare yourself blind at the event horizon without making any further progress.
But, any serious examination of language relatedness (though, natch, not Ruhlen's) will quantify the degree of entropy in the system, and thus we absolutely can know when we've gone back too far.

In some sense, there may be such a barrier, but we cannot know precisely where it is (5,000 years? 10,000 years? 20,000 years? more?), and this all depends on the available data, so that, actually in theory, we could track the languages back farther in time given more data. So there is no barrier per se, except as a limit to what we have available to us (data), and we can never actually know where that barrier is. We can make some pretty reasonable guesses about what to do and what not to do, but it's an unscientific position to point to a barrier and then attempt to give it some kind of cutoff date.

People do not point to a barrier and then attempt to give it some kind of cutoff date. I have not done that, freknu has not done that, and nobody arguing against Ruhlen has done that. That's just not what's happening here. If you're interested in this stuff, I'd encourage you to do some of the actual number-crunching yourself sometime. As it stands, however, you seem to be missing some fundamentals and as a consequence are misunderstanding what I (and everyone else) are actually saying.

As with many statistical phenomena, the certainty of historical reconstructions does not scale with the quality of available data in a linear way. This is a very important fact. If your data is 50% as good as some baseline, your reconstructions will be substantially *less* than 50% as reliable as the baseline's. When you start building probabilistic reconstructions from probabilistic reconstructions from probabilistic reconstructions, you approach randomness not by steps but by leaps.

In other words, proper stochastic reconstructions *begin* with the understanding that a certain extent of apparent concordance will be present in every comparison by mere random chance, and there are tried-and-true ways of quantifying this extent of chance. Once you get past a certain point, the system contains enough mutation to push the similiarities into a band of probability that makes it fundamentally impossible to separate from chance. This is not pointing at an arbitrary barrier, as you keep insisting. It is a basic, empirical observation emerging from the character of the data itself. That's just how the math works. At a certain point, your measured reliability just drops off a cliff. This is not strictly a function of time, but in many of the world's large language families that precipice tends to sit (because of the data!) right around 6-10k y.a.

1. Each hypothesis should be judged based on (heuristically) how well it could possibly be determined, not "whether" it can be determined.
2. For each hypothesis space (eg, what's the ancestor of PIE), we should (time permitting and risk/reward deemed worthwhile) find the best hypothesis and consider it along with (1).

What you are describing here is "science", plain and simple. It's what everyone worth their beans already does. Like I said, you're arguing against strawmen.

People do not point to a barrier and then attempt to give it some kind of cutoff date. I have not done that, freknu has not done that, and nobody arguing against Ruhlen has done that.
See Ringe's comments in that youtube video. (The video itself covers the positions of most people involved in this debate, so it's worth seeing, if you haven't seen it.)

And the problem is that people seem to dismiss Ruhlen's position without doing that number crunching you talk about. It's unlikely that Ruhlen's conclusions are correct. We all agree there. But it's not a very good argument to point to some unknown limit and speculate that it's probably before 100,000 years (or whatever), when we don't really know. Instead, it's much simpler to just point out that it's very hard to go that far back and therefore very unlikely that anything at 100,000 years (or whatever) is reliable.

Like I said, you're arguing against strawmen.
What's the coherent and specific argument against Ruhlen et al then?

What you are describing here is "science", plain and simple.
Indeed. So why this talk of some arbitrary date past which we can't do reconstruction? It's hinted at in this thread and it is stated explicitly by Ringe.

Once you get past a certain point, the system contains enough mutation to push the similiarities into a band of probability that makes it fundamentally impossible to separate from chance.
WHAT POINT?
That's my entire objection. You're hand-waving if you can't give a date (or general window) for that point.
And, yes, that "point" is the barrier/wall/limit I've been talking about.

This is not pointing at an arbitrary barrier, as you keep insisting. It is a basic, empirical observation emerging from the character of the data itself. That's just how the math works. At a certain point, your measured reliability just drops off a cliff. This is not strictly a function of time, but in many of the world's large language families that precipice tends to sit (because of the data!) right around 6-10k y.a.
Ok! And there's a year. Good.
So:
1. Can you point to some research that defends those dates? This is one particular area that I haven't looked into.
2. Does that apply to every conceivable method of dealing with the data? Is that date not boosted by comparing reconstructions to reconstructions? (Obviously the reliability goes down over time with any method, but there need not be such a hard limit necessarily.)
3. If you claim specifically there is indeed a 6-10,000 year limit, then you are disagreeing with my point above, that we cannot locate such a limit. And that's fine. I'll gladly be wrong about that, but then we should discuss that detail rather than the bigger picture. And that's a good thing-- specifics are important here.

