Historical Linguistics / Re: Romance languages not descended from Latin.
December 16, 2018, 02:02:38 AM
That addresses some relevant details, but overall, setting aside terminological preferences, I don't think it changes the most general points, as indicated in the conclusion of the article. (I just briefly skimmed it though.)
Historical Linguistics / Re: Romance languages not descended from Latin.
December 15, 2018, 03:39:23 PM
I came across this article:
Language-specific analysis / Re: What language is this?
December 14, 2018, 10:18:58 AM
Well, the speech rhythm seems all wrong for a Celtic language, but that gives you something to test. You're likely to run into the problem that you faced with the Finnish person you talked to, that individuals may have attenuated ability to recognize a related language. While a Maltese speaker would most likely recognize that a speaker of another Maltese dialect is speaking a "related language", I doubt they would recognize (or be recognized by) a speaker of Gulf Arabic or Amharic.
Computational Linguistics / Re: Recommendation on Diachronic Corpus
December 13, 2018, 04:57:57 PM
For English? (Although it may be more interesting to look at other languages, the resources will be more limited than what I describe below.)

There are a number of possibilities, and there will be a tradeoff between quality and quantity.

One of the largest corpora is Google Ngrams based on data from Google Books, but the search tools are limited and it isn't full text, and once you get back before 1800 the data isn't quite as good. It's especially limited during the 1500s, and somewhat better after 1600, but still much less data then than later. What that means is that you will have potentially unrepresentative data points in the earlier years (and therefore much more noise in the graphs). Still, even that limited data may be more than you find in some other corpora, but it isn't balanced with the rest.

There are some nice corpora offered at BYU:
They have a good balanced between quality, features, and ease of use, though with some limitations (you can't download the full text for all of them, you have limited queries per day depending on your access type after [free] registration, etc.). COHA is very nice for American English 1800-2000, better (but less) data than Google Ngrams.

A few other specialty corpora there are also helpful:
The Hansard Corpus is especially nice, because it represents spoken British English from parliamentary proceedings since 1803, so that's really unique. Of course the language is formal, but I've found it useful.
Similar, but written and even more formal, is the new American Supreme Court corpus, from 1790.
The Time Magazine corpus is also interesting, though only the 1900s.

EEBO (Early English Books Online) is now available through there (also search online for "EEBO-TCP" for several other interfaces), and that's a good source for earlier material, better than Google Ngrams for the time, and full text is available (at least through some interfaces with login). But it's mid-1400s through 1600s, so it only covers about a century of the time period you're looking for. (Watch out for variant spellings in the corpus, so you'll need to do a lot of manual work to get [and interpret!] the best data, but for that time period it's very useful.)

So as you can see, having a single corpus from 1600s-1900s is going to be difficult, although you could try to compare some features across several corpora to cover the full range. If you can pick just a subset of those years, I would personally strongly suggest the Hansard corpus because it is spoken language, and represents 200 years, so there's a lot to work with.

There are also some smaller corpora that might fit your requirements and maybe for almost the full period you're considering.

For example, Corpus of Late Modern English Texts (CLMET, in several editions, e.g. 3.0) is much smaller than some of the options above, but it's also maybe better balanced, and if you don't need a huge amount of text (either you're looking for relatively common features, or you're planning to look at each example manually so you can't handle a lot of data anyway), then something like that (it's just one example) might be good for you:
Of course for something like that you're probably going to be mostly getting data from published books, though sometimes you'll find some personal correspondence (letters) if you want something more natural. (There's also the question of whether you're prefer probably very formal non-fiction, or possibly unrepresentative but colloquial fiction, e.g., for examples of dialog.)

There are also various specialty corpora such as full collections of all of Shakespeare's works, but those won't easily generalize to the full 400-year period.

Those are just some examples from my own experience. There are some other options as well, even making your own corpus from books you find online (anything about 100 years ago or older is likely accessible online for free from a combination of sources like Google Books,, Project Gutenberg, etc.). For example, a simple scenario would be to choose one similar novel from each century and compare them, but there are much more complex ways to do it too. Something else I have seen is using an existing database of collected examples such as those found in the Oxford English Dictionary (searchable online with a subscription, probably through your university). That can work and offers a wide range of texts throughout the history of English, although the selection of examples in the OED is biased for illustrative purposes rather than a real balanced picture of what English was like. And don't assume they've really found the earliest examples for any words in those entries-- the OED is a huge project and therefore of limited accuracy for any individual word if you're most interested in when usage changed. It's good, but not the definitive answer on anything. (I saw a compelling conference abstract about how the OED has biased many research results in this way, with authors thinking something is later than it really was if you look into the details yourself, and I've done the same for my own work.)

Something else to watch out for is that I have personally gotten the impression that it's easy to think whatever phenomenon you're looking at "starts" near the beginning of your corpus, when the frequency is increasing. Be very careful about making such generalizations. (From my own research I know of a published paper, which otherwise appears strong methodologically, but claims something began in the 1800s because it seemed to be increasing at the beginning of that period, but then more recent research has shown it was found starting in the 1500s.) It's very easy to get that impression, and I wonder exactly why this is, but watch out for it, especially for the time periods you're talking about.

In summary, there are a lot of sources, but you'll need to find what works for you. Finding a single good source, even if it wasn't exactly what you had planned, might be motivation enough to reframe your research questions to fit (for example, using the Hansard corpus for 1800s-1900s, rather than starting in the 1600s with a mix of corpora). Corpora also have very different genres represented, so watch out for that both in selecting them in general and also especially if you mix them to look at a larger timeframe. If you must do that, then the most consistent source (but not necessarily best data) will be published books.

The other consideration is your technical skills: if you can write enough code to search, organize and compile the results from plain text, then you might be best making your own corpus from texts available online. If you can't do that, then you should rely on some of the easy-to-use options (some of which are mentioned above) with automatic search functions, etc. The other question is how you will search the data: do you need a tagged corpus (with part of speech and other features) or do you want plain text? Are you looking at syntax? Morphology? That can have an impact on what kind of corpus you need. And also how much data you need, depending on the frequency of the phenomenon in question. An easy benchmark is to pick a corpus of Modern English (maybe COCA or BNC or just Google Ngrams) and then do a basic search to see how many results you get per million words-- that will give you an idea of the smallest reasonable size you can work with.
Computational Linguistics / Recommendation on Diachronic Corpus
December 13, 2018, 11:19:54 PM
I am a graduate student of linguistics and my thesis focuses on language change. Currently, I am looking for a diachronic corpus ranging from 1600s to 1900s. Your recommendation will be much appreciated. Thank you.
Language-specific analysis / Re: What language is this?
December 13, 2018, 08:39:44 PM
Belated thanks again. My Finnish friend didn't recognize anything. But what about Cornish or a relative thereof? It sounds to my untrained ear a bit like this woman:
Morphosyntax / Re: Spelling -ing form
December 12, 2018, 02:33:17 AM
Thank you for your explanation.
Morphosyntax / Re: Spelling -ing form
December 10, 2018, 06:09:00 PM
That's not exactly the pattern. The pattern is that short vowels tend to be followed by double consonants. So "siting" would be pronounced like "citing" or "sighting", while "sitting" maintains the short vowel. Most monosyllabic verbs have short vowels, so that's the general pattern.

What you'd need to test this is a pair of verbs with a short/long contrast. For example, "tow" (long, rhymes with "toe") and "bow", although that's really a diphthong so it would be treated like a vowel I guess. Of course "towing" would not have doubling, but then also "bowing" does not, so it's ambiguous in pronunciation between long a short vowels. But of course English often is ambiguous in spelling anyway, so that's no surprise.

The explanation, if there is one, comes from historical reasons: the doubling of consonants is an artifact of much earlier usage (probably going back all the way to Old English) representing syllable patterns. Compare Italian where doubled consonants are pronounced differently, and therefore correspond also to different vowel pronunciations (consonant clusters also were an indication of syllable weight in Latin). So this really isn't a rule as much as an accidentally pattern based on old usage, which then happened to generalize a bit because it was useful. Doubled consonants in English aren't pronounced differently, but they're like the opposite of a "final silent -e" marking long vowels in the previous syllable.

Of course for W in particular the explanation is in the name: it was originally, literally a double-U, a sort of in-between consonant/vowel letter (like Y). The result is that it doesn't double like other letters, plus the lack of many times when it would obviously need to be doubled. Lack of existing examples is one way that a spelling rule won't spread.

Regardless, if you Google "bowwing" you'll find several websites correcting the spelling, suggesting that some English speakers do try to extend the pattern to those words, probably especially when the vowel is short.

A-vowels in English are especially weird, because they have three possible pronunciations: "bat" /æ/, "draw" /a/, and "late" /e:/. Your example of "draw" would make sense as "drawwing", but I'm guessing that's not how it works specifically because of the confusion of the three forms of A not fitting a simple long/short distinction.

In the end, any English "spelling rules" aren't rules at all, because English spelling doesn't follow rules, just vague patterns, and there are always exceptions. Some studies have shown that Chinese learners of English do very well with English spelling by memorizing many common word forms, rather than learning these patterns. In other words, treating English spelling as arbitrarily as Chinese characters works well for them because they're used to the memorization strategy. Other learners, and native speakers too, have trouble when they try to follow the "rules" because they just don't work all the time. Native speakers therefore have some general patterns learned as probable rules, but also memorize many exceptions. Sometimes it has to do with particular letters, like W just not being doubled.

Remember, English spelling was standardized about 1500 years ago (following the invention of the printing press), and aside from some minor changes (like differences in British and American spelling of "color"), it hasn't shifted since. At the same time, at the beginning the Great Vowel Shift substantially changed how vowels were pronounced, then for several hundred years pronunciation continued to change too, and the result is a big mess that doesn't follow "rules", often not even patterns. English spelling is etymological rather than logical. English spelling reform is a whole different topic, which has often been a popular idea but never a popular action, and there are some problems with it, especially two big ones: (1) then we wouldn't be able to read most of the internet, or Shakespeare, unless we also taught the "old" spelling; and (2) surprisingly, almost all spelling contrasts, no matter how odd, are pronounced differently in some dialect of English, so any changes would start to collapse those to a standard pronunciation. There's also the question of what it would look like: spelling English in IPA (for example) just looks wrong, and very un-English. So for now, lots of memorization...
Morphosyntax / Spelling -ing form
December 10, 2018, 01:04:53 PM
According to the grammar rule, in one-syllable verbs ending in consonant-vowel-consonant we double the last consonant, as in sit > sitting.

Why don't we double the consonant in verbs like "draw" /drɔː/, grow /ɡrəʊ/etc.? Is it because we must look at the sound, not the written letter?
Full Title: Conference on Asian Linguistic Anthropology 2020
Short Title: The CALA 2020

Date: 05-Feb-2020 - 08-Feb-2020
Location: Bintulu, Sarawak, Malaysia

Contact Person:
     Assoc. Prof. Dr. Hazlina Abdul Halim
     Head, Dept. of Foreign Languages
     Faculty of Modern Languages & Communication
     Universiti Putra Malaysia

Linguistic Field(s): Anthropological Linguistics; General Linguistics

Language Family(ies): Afroasiatic; Altaic; Austro-Asiatic; Austronesian; Indo-European; Japanese Family; Latin Subgroup; Sino-Tibetan

Call Deadline: 09-Apr-2019

CFP Description:

Following the success of the CALA 2019, The Conference on Asian Linguistic Anthropology 2019, in Cambodia, we announce The CALA 2020, February 5-8, 2020, at The University Putra Malaysia, Bintulu, Sarawak, Malaysia.

Purpose and Structure - The CALA 2020 invites Linguists, Anthropologists, Linguistic and Cultural Anthropologists, Culturologists, Sociologists, Political Scientists, and those in related fields pertinent to Asia.

Details - The University Putra Malaysia, Bintulu, Sarawak, Malaysia, February 5-8, 2020

- The American Anthropological Association (Official Partner)
- Taylor and Francis Global Publishers (Official Publishing Partner)
- 60 academic institutions globally (Nanyang Technological University, University of Hawai'i, Temple University, University College London, SOAS, Hong Kong Polytechnic University, Indian Institute of Anthropologists, and so forth).
- Scientific Committee of over 100 academics globally prominent in Linguistic Anthropology and related fields

Theme - Themed Asian Text, Global Context, The CALA 2020 will represent over 300 years of East-West global interaction, communication, and transnationalism. Throughout, symbolisms of Asian 'texts' have been significantly emphasized, (re)interpreted, contested, and distorted, while employed for cultural and political purpose. Asian texts have become highly representational, authenticating, and legitimizing sociopolitical and cultural devices, and their potency should not be underestimated. Never have these texts shown more significance than in the present, as their intensified use, and their qualities in Asian identities long contested, seek this Linguistic Anthropological exploration.

Call for Papers:

Publications - We advise that several Special Journal issues are planned, as well as a collection of Monographs, from papers submitted to the CALA, that meet the requirements of submission, review and acceptance. The papers selected will all be published with Top-Tier Ranking journals, and their Publishers. Here, ample assistance will be provided to revise manuscripts for publication.

Presentation lengths:
Submitters must plan around the following:
- Colloquia - 1.5 hours with 3-5 contributors (Part A and B is possible, thus 6-10 contributors)
- General paper sessions - Approx. 20-25 minutes each, which includes 5 minutes for questions/responses
- Posters - to be displayed at designated times throughout the CALA

Abstract and poster proposals should address one or more of the key strands related to Asian countries and regions:

– Anthropological Linguistics
– Applied Sociolinguistics
– Buddhist studies and discourses
– Cognitive Anthropology and Language
– Critical Linguistic Anthropology
– Ethnographical Language Work
– Ethnography of Communication
– General Sociolinguistics
– Islamic Studies and discourses
– Language, Community, Ethnicity
– Language Contact and Change
– Language, Dialect, Sociolect, Genre
– Language Documentation
– Language, Gender, Sexuality
– Language Ideologies
– Language Minorities and Majorities
– Language Revitalization
– Language in Real and Virtual Spaces
– Language Socialization
– Language and Spatiotemporal Frames
– Multifunctionality
– Narrative and Metanarrative
– Nonverbal Semiotics
– Poetics
– Post-Structuralism and Language
– Semiotics and Semiology
– Social Psychology of Language
– Textualization, Contextualization, Entextualization

Abstract submissions - The Call for Abstracts is now open, at the below links, which contact all pertinent information.

Anthropological Excursion - Bintulu, Sarawak, Malaysia

See website at, and at for full CFP and all information.
