Author Topic: Counting the most frequently used words in everyday life  (Read 1551 times)

Counting the most frequently used words in everyday life
« on: July 14, 2015, 05:28:50 AM »
Can anyone please educate me on the methodology of how to measure the most frequently used words in everyday life? I need to know the most used 100 words in English...Thank you...

Re: Counting the most frequently used words in everyday life
« Reply #1 on: July 14, 2015, 02:06:25 PM »
It depends on how realistic you want the figures to be, and what they are supposed to represent. The basic method is to acquire a large random corpus covering "everyday life", and then count how many times each word occurs. You would need to somehow define this concept of "everyday life"; for example, casual chit-chat about somebody's funeral, or a huge storm, would not be "everyday life" because huge storms and funerals are uncommon events. I assume you want to exclude certain topics -- what exactly (or, why)? Also, you have to at least think about the legitimacy of data so gathered, because not every string of English words out there is generated by a speaker of English. A corpus survey of Somali will probably have a minimal volume of spurious data since few people use Somali words if they aren't actually speakers of Somali, but a lot of people use English words even though they don't actually speak English.

You need to make some decisions about what words, if any, aren't counted (e.g. "the", "in"), or whether you want to count exact word forms ("cat" is one word, "cats" is another) or do you merge sing/sang/sung into one "word". From the opposite POV, would every instance of "stick" count as one word? There's an intransitive verbal sense meaning "adhere", a transitive one meaning "place", and a noun meaning "piece of wood". I would count those as three words (which happen to be spelled the same).

[Some further clarification: a terminological distinction is made between "word" and "lemma", where "runs, running, ran..." have a common lemma, but they are separate words. Also, if you can dispense with the "everyday life" topic requirement, lists words without apparent exclusions (so "the" is apparently the most frequent word), but with lemmatization (reduction of the various inflected forms of "be" to one word). This is probably about what you would get if you excluded fancy academic publications a pieces of refined literature.
Re: Counting the most frequently used words in everyday life
« Reply #2 on: July 15, 2015, 04:35:10 PM »
The post above is a good answer. He's my (probably oversimplified) slightly different (but mostly similar) version:

1. Gather a corpus.
(As with any statistical method, this is a sample, rather than a representation of the whole data set. So you want to gather the best sample possible-- large, contextually appropriate, well-transcribed, etc.)

2. Count the number of tokens of each word. Rank them. Make a list.

This is studied quite often and there is a Zipfian distribution to the words, so that the most common words are very common and the least common words are very uncommon.

In other words, for measuring the MOST common words, you don't need a very large data set. (The most common word is "the".)

For measuring the LEAST common words, you need a much larger data set. Actually this is where it gets very difficult and where problems exist in corpus linguistics. It's not hard to measure common things accurately but very hard to measure infrequent things accurately.

A good test would also be to use several different sources. Does casual writing have the same results as conversational transcriptions? If the results match, you can be confident. If they do not, you can try to figure out why.

This website has a lot of information and is associated with corpora designed and maintained by Mark Davies, one of the leading researchers in this field:

That's based mostly on written text, though. There are corpora of spoken English, if you want to do that.

One problem with your question is that you do not define "everyday life". It seems obvious, but you actually need to decide what this means. Spoken? Written? Native speakers only or everyone who speaks English? Where? Why? What topics? Frequency varies by discourse style, topic, participants, emotion, etc.
