Author Topic: Recommendation on Diachronic Corpus  (Read 7850 times)

Offline Jennifer Shen

  • New Linguist
  • *
  • Posts: 1
Recommendation on Diachronic Corpus
« on: December 13, 2018, 11:19:54 PM »
I am a graduate student of linguistics and my thesis focuses on language change. Currently, I am looking for a diachronic corpus ranging from 1600s to 1900s. Your recommendation will be much appreciated. Thank you.

Offline Daniel

  • Administrator
  • Experienced Linguist
  • *****
  • Posts: 2073
  • Country: us
    • English
Re: Recommendation on Diachronic Corpus
« Reply #1 on: December 13, 2018, 04:57:57 PM »
For English? (Although it may be more interesting to look at other languages, the resources will be more limited than what I describe below.)

There are a number of possibilities, and there will be a tradeoff between quality and quantity.

One of the largest corpora is Google Ngrams based on data from Google Books, but the search tools are limited and it isn't full text, and once you get back before 1800 the data isn't quite as good. It's especially limited during the 1500s, and somewhat better after 1600, but still much less data then than later. What that means is that you will have potentially unrepresentative data points in the earlier years (and therefore much more noise in the graphs). Still, even that limited data may be more than you find in some other corpora, but it isn't balanced with the rest.

There are some nice corpora offered at BYU:
They have a good balanced between quality, features, and ease of use, though with some limitations (you can't download the full text for all of them, you have limited queries per day depending on your access type after [free] registration, etc.). COHA is very nice for American English 1800-2000, better (but less) data than Google Ngrams.

A few other specialty corpora there are also helpful:
The Hansard Corpus is especially nice, because it represents spoken British English from parliamentary proceedings since 1803, so that's really unique. Of course the language is formal, but I've found it useful.
Similar, but written and even more formal, is the new American Supreme Court corpus, from 1790.
The Time Magazine corpus is also interesting, though only the 1900s.

EEBO (Early English Books Online) is now available through there (also search online for "EEBO-TCP" for several other interfaces), and that's a good source for earlier material, better than Google Ngrams for the time, and full text is available (at least through some interfaces with login). But it's mid-1400s through 1600s, so it only covers about a century of the time period you're looking for. (Watch out for variant spellings in the corpus, so you'll need to do a lot of manual work to get [and interpret!] the best data, but for that time period it's very useful.)

So as you can see, having a single corpus from 1600s-1900s is going to be difficult, although you could try to compare some features across several corpora to cover the full range. If you can pick just a subset of those years, I would personally strongly suggest the Hansard corpus because it is spoken language, and represents 200 years, so there's a lot to work with.

There are also some smaller corpora that might fit your requirements and maybe for almost the full period you're considering.

For example, Corpus of Late Modern English Texts (CLMET, in several editions, e.g. 3.0) is much smaller than some of the options above, but it's also maybe better balanced, and if you don't need a huge amount of text (either you're looking for relatively common features, or you're planning to look at each example manually so you can't handle a lot of data anyway), then something like that (it's just one example) might be good for you:
Of course for something like that you're probably going to be mostly getting data from published books, though sometimes you'll find some personal correspondence (letters) if you want something more natural. (There's also the question of whether you're prefer probably very formal non-fiction, or possibly unrepresentative but colloquial fiction, e.g., for examples of dialog.)

There are also various specialty corpora such as full collections of all of Shakespeare's works, but those won't easily generalize to the full 400-year period.

Those are just some examples from my own experience. There are some other options as well, even making your own corpus from books you find online (anything about 100 years ago or older is likely accessible online for free from a combination of sources like Google Books,, Project Gutenberg, etc.). For example, a simple scenario would be to choose one similar novel from each century and compare them, but there are much more complex ways to do it too. Something else I have seen is using an existing database of collected examples such as those found in the Oxford English Dictionary (searchable online with a subscription, probably through your university). That can work and offers a wide range of texts throughout the history of English, although the selection of examples in the OED is biased for illustrative purposes rather than a real balanced picture of what English was like. And don't assume they've really found the earliest examples for any words in those entries-- the OED is a huge project and therefore of limited accuracy for any individual word if you're most interested in when usage changed. It's good, but not the definitive answer on anything. (I saw a compelling conference abstract about how the OED has biased many research results in this way, with authors thinking something is later than it really was if you look into the details yourself, and I've done the same for my own work.)

Something else to watch out for is that I have personally gotten the impression that it's easy to think whatever phenomenon you're looking at "starts" near the beginning of your corpus, when the frequency is increasing. Be very careful about making such generalizations. (From my own research I know of a published paper, which otherwise appears strong methodologically, but claims something began in the 1800s because it seemed to be increasing at the beginning of that period, but then more recent research has shown it was found starting in the 1500s.) It's very easy to get that impression, and I wonder exactly why this is, but watch out for it, especially for the time periods you're talking about.

In summary, there are a lot of sources, but you'll need to find what works for you. Finding a single good source, even if it wasn't exactly what you had planned, might be motivation enough to reframe your research questions to fit (for example, using the Hansard corpus for 1800s-1900s, rather than starting in the 1600s with a mix of corpora). Corpora also have very different genres represented, so watch out for that both in selecting them in general and also especially if you mix them to look at a larger timeframe. If you must do that, then the most consistent source (but not necessarily best data) will be published books.

The other consideration is your technical skills: if you can write enough code to search, organize and compile the results from plain text, then you might be best making your own corpus from texts available online. If you can't do that, then you should rely on some of the easy-to-use options (some of which are mentioned above) with automatic search functions, etc. The other question is how you will search the data: do you need a tagged corpus (with part of speech and other features) or do you want plain text? Are you looking at syntax? Morphology? That can have an impact on what kind of corpus you need. And also how much data you need, depending on the frequency of the phenomenon in question. An easy benchmark is to pick a corpus of Modern English (maybe COCA or BNC or just Google Ngrams) and then do a basic search to see how many results you get per million words-- that will give you an idea of the smallest reasonable size you can work with.
« Last Edit: December 13, 2018, 04:59:30 PM by Daniel »
Welcome to Linguist Forum! If you have any questions, please ask.