Author Topic: Estimating word class size. Why cannot it be done?  (Read 4097 times)

Offline norweger

  • New Linguist
  • *
  • Posts: 2
Estimating word class size. Why cannot it be done?
« on: January 04, 2014, 09:54:34 PM »
There are about 600.000 words in the Oxford Dictionary, and it's them I am speaking about here. There are those who claim it's nearly impossible to make a very rough estimate of how big a English word class is compared with another because many words belong to several classes. It's easy to agree that words like «love» can be both a noun and a verb, and that clouds the picture, but the word can't be a conjunctive or a preposition, so it's not compete chaos.

If we for example take fifteen words, my opinion is that we can put them in classes, and it can make sense although the lines are sometimes blurred. If we can do it with fifteen, why would it be impossible to do the same with large samples?

Actor: noun
Scarecrow: noun
Obey: verb
Gather: verb, noun
Glamorous: adjective
Helpful: adjective
Better: verb, adverb, adjective, noun
Into: preposition
With: preposition
To: preposition, adverb
Yet: adverb, conjunction
You: pronoun
Their: determiner
Underneath: preposition, adverb
Wow: exclamation, verb, noun

Because this sample is so small, it would seem that nouns are a very large class, adverbs, verbs, and prepositions are large classes, adjectives is a big class, and pronouns, exclamations, determiners and conjunctions are minuscule classes.

At least etymologically speaking, it should be possible to trace a word back to it's original meaning, and thereby rank the word classes correctly after size – or can it? What's your opinion?

Would you care to make a guesstimate as to how many of the 600.000 words pertain to each word class in the dictionary?

Offline Daniel

  • Administrator
  • Experienced Linguist
  • *****
  • Posts: 1840
  • Country: us
    • English
Re: Estimating word class size. Why cannot it be done?
« Reply #1 on: January 04, 2014, 10:21:03 PM »
There are a lot of layers to this, and we should discuss some assumptions. For example, an obvious one is that there are exactly about 8 word classes in English and that words "belong" to them. If you do make that assumption and you have criteria for deciding, it's very easy to do this, if you have the time for all 600,000 words. But why would you want to do that? Can you, by categorizing all of the words, accomplish something? That's why we'd need to look a little deeper.

Quote
and it can make sense although the lines are sometimes blurred.
That's part of the problem. You can't have something blurry if you want to count it.

A "very rough estimate" seems possible. In fact, we can arrive at that intuitively:
Quote
nouns are a very large class, adverbs, verbs, and prepositions are large classes, adjectives is a big class, and pronouns, exclamations, determiners and conjunctions are minuscule classes.
Yes, certainly. That's fairly well known. I don't think you'd find many people who disagree with you about it.

Quote
At least etymologically speaking, it should be possible to trace a word back to it's original meaning, and thereby rank the word classes correctly after size – or can it? What's your opinion?
This is confusing. Why are you mixing word class and etymology? There's no relationship there, although you could attempt to statistically analyze how often (and which) word classes change from one to another.
Word classes are properties of lexical items in a specific grammar at one time, not throughout history or space (where they might or might not change from generation to generation or person to person).

Quote
Would you care to make a guesstimate as to how many of the 600.000 words pertain to each word class in the dictionary?
It depends on how precise you want to be. Ranking the classes is relatively easy. One quick way of doing this is to use a corpus. Google Ngrams is convenient:
https://books.google.com/ngrams/graph?content=_NOUN_%2C_VERB_%2C_ADJ_
Very clearly, there are more nouns than verbs, and more of both than adjectives. Add in the other parts of speech if you want.
Obviously that's actually about usage including multiple repetitions of the same word, but broadly that's similar to, at least in ranking, the distributions in the lexicon. You could easily (but probably not easily through a web interface) look for only unique items in the corpus and count up how many there are.

If you want to count up everything in the OED, go ahead. It'll look a lot like your list of 15 words. You could, if you wish, take a sample of maybe 1000 words and see what the distribution is like. That'll be a rough approximation.

There may be a way to do this automatically-- the OED does have word classes listed, so you could just count those up on the computer if you can access the raw data properly. They might already have that info somewhere.

But then we circle back to the questions above:
1. Are the labels applied correctly in the dictionary (and could you do any better if not)?
2. What's the point in labeling these anyway?


I believe that anyone who would question whether a rough approximation is possible would do so because they believe that all of this (the definitions, the existence of classes, etc.) is uncertain.

A real problem with this in English is the process of "zero-derivation": Google is a noun that became a verb. So is it in one class or two? It's easy enough to count, but a lot harder to decide why, what and how to count.
Welcome to Linguist Forum! If you have any questions, please ask.

Offline freknu

  • Forum Regulars
  • Serious Linguist
  • *
  • Posts: 396
  • Country: fi
    • Ostrobothnian (Norse)
Re: Estimating word class size. Why cannot it be done?
« Reply #2 on: January 04, 2014, 10:28:02 PM »
I don't know any specifics, but there are two general rules of thumb that are probably close to truth:

  • content words (generally productive) far outnumber function words (generally non-productive)
  • proper nouns (names) and common nouns ("things") outnumber other content words

However, this all depends on the language. Some language do not have distinct nouns, some lack a contrast between noun and adjective, and so on.

If you had a database of words and their word class, you could just count the number of entries in each class.

Using a wordlist with 1448 words of basic non-compound core vocabulary (my native tongue), I get the following numbers:

229 adjectives
794 substantives
425 verbs

However, this list excludes many basic derivations and does not consider compound words at all. This is also a work-in-progress and is very "non-uniform".

It might also be a lot easier to determine the productivity of each class and thus indirectly approximating their relative size, rather than directly determining absolute size.

Crunching numbers can be informative and constructive, but only if you have a clear goal that can be expected to have a reasonable and relevant answer.

That is, does it matter which class is greater in number?
« Last Edit: January 04, 2014, 10:29:57 PM by freknu »

Offline Daniel

  • Administrator
  • Experienced Linguist
  • *****
  • Posts: 1840
  • Country: us
    • English
Re: Estimating word class size. Why cannot it be done?
« Reply #3 on: January 04, 2014, 10:44:12 PM »
Quote
If you had a database of words and their word class, you could just count the number of entries in each class.
Exactly. Counting is easy. Defining is hard.

Quote
That is, does it matter which class is greater in number?
Indeed. If not, if this is just out of curiosity, you just need to go through the OED (or anther source) and count. Simple enough.


One more thing to add:
There might be a difference between the entire dictionary of English (eg, OED, a corpus, etc.) and what an individual speaker knows. Certainly the 20,000-40,000 words known by the native speaker will be fewer than in the dictionary overall, but I don't know if the relative rankings would change. I imagine that the rankings would be consistent in most cases, but that the relative sizes would vary-- there might be 10x more nouns than verbs in the OED but only 4x more in your personal knowledge.

You should also look here:
http://en.wikipedia.org/wiki/Zipf%27s_law

In short, we probably know all of the conjunctions, but only a very tiny fraction of the nouns. Likewise, we may know most of the verbs (because there are overall fewer).

In English, I would guess the following order is reasonable:

Noun > Adj > Verb > Adv* > Prep > ProN > Det > Conj
(For the smaller classes it depends on what you group in there-- pronouns are larger if we include words like "whatever" and "everyone". Determiners are larger if we include quantifiers. I'm assuming both of those in the rankings above. There are also interjections if you consider those relevant-- they're probably between Prep & ProN-- "oh", "ooh", "ouch", "yes", "no", etc. Also the word "not" is probably in its own class, so that's at the bottom.)

(*Adverbs are about equal in size to adjectives if you include all -ly adverbs. I don't find that very interesting personally, though, just that we could make a new adverb "supercalifragilistically". In fact, if we do that, Adv might be slighter larger than Adj due to a few adverb forms that are not also adjectives, like "however".)
« Last Edit: January 04, 2014, 11:06:21 PM by djr33 »
Welcome to Linguist Forum! If you have any questions, please ask.

Offline norweger

  • New Linguist
  • *
  • Posts: 2
Re: Estimating word class size. Why cannot it be done?
« Reply #4 on: January 25, 2014, 08:30:11 PM »
Thanks a lot for the Google Ngrams link!

There are a lot of layers to this, and we should discuss some assumptions. For example, an obvious one is that there are exactly about 8 word classes in English and that words "belong" to them.

Yes, the number of classes isn't static or unanimously agreed on. In the school grammar of my language – currently, this changes – they operate with 10 classes. The names of those, and the corresponding number of words, are:

Noun: 17 000
Adjective: 9000
Verb: 5300
Adverb: 1500
Preposition: 250
Interjection: 200
Pronoun: 50
Conjunction: 8
Subjunction: Some hundred words
Determiner: Some hundred words

(This is from a rather small list containing about a fifth of the dictionary words in my language. By the way, brilliant guessing when it comes to relative sizes of lexical classes in the English language.)

When you say that an assumption that's debated is whether a word belong in a class, do you then only mean the four large classes, or do you also mean that this is up for debate when it comes to the small classes?

«At least etymologically speaking, it should be possible to trace a word back to it's original meaning, and thereby rank the word classes correctly after size – or can it? What's your opinion?» This is confusing. Why are you mixing word class and etymology? There's no relationship there, although you could attempt to statistically analyze how often (and which) word classes change from one to another.

What I meant is that in English, a verb in the present tense and verbs in gerund form and present participle are spelled the same – and they act like verbs, nouns and adjectives/adverbs. Playing and building are such examples. In English this makes everything much more blurry. What do you see as the most prominent examples of words that do not exactly fit into a lexical class?

If we try to trace the word building back to it's roots (what I meant by etymologically speaking – by the way, am I using the word etymologically wrong?), one can find out how the word came to be. Like building is a form of build, and build was originally a verb.
« Last Edit: January 25, 2014, 08:46:40 PM by norweger »

Offline Daniel

  • Administrator
  • Experienced Linguist
  • *****
  • Posts: 1840
  • Country: us
    • English
Re: Estimating word class size. Why cannot it be done?
« Reply #5 on: January 25, 2014, 08:54:01 PM »
Quote

Noun: 17 000
Adjective: 9000
Verb: 5300
Adverb: 1500
Preposition: 250
Interjection: 200
Pronoun: 50
Conjunction: 8
Subjunction: Some hundred words
Determiner: Some hundred words
Doesn't look too surprising. Certainly you have some good ideas now about that.
Adjective and verb might switch places depending on the language and exactly what data set you are using. The rest is about what I'd expect. The function words are all infrequent and it's really unclear exactly how many pronouns there are and how you'd count them, but regardless there aren't too many overall.

Quote
When you say that an assumption that's debated is whether a word belong in a class, do you then only mean the four large classes, or do you also mean that this is up for debate when it comes to the small classes?
Everything is unclear. It's unclear what it means to "belong to a class" and even the meaning of "class".  In order to do anything productive with this, you must first make some assumptions, but those assumptions will determine the results. So again, the "why" question is critical here. What's the point of this? Does it matter if you choose one definition of "noun" or another?

Quote
What I meant is that in English, a verb in the present tense and verbs in gerund form and present participle are spelled the same – and they act like verbs, nouns and adjectives/adverbs. Playing and building are such examples. In English this makes everything much more blurry. What do you see as the most prominent examples of words that do not exactly fit into a lexical class?
Participles. Infinitives. Adverbs (-ly in English looks like an inflection for adjectives to me!). Nouns and verbs sometimes ("Google it!").
Then there are many problems with subclasses. Are transitive and instransitive verbs in the same class? Do verbs that allow flexible argument structures belong to two (sub)classes? Does "right" belong to N, V and Adj? Or just one of those?
In the end... it's all fuzzy.

You can certainly make some decisions and go with those (dictionaries do this for example!) but there's no fundamental "answer" that is clear to everyone. Again, we return to the motivation: what's the point of this? Are you happy with the Latin-inherited grammar system?

The "subjunction" category is new to me (and I actually study coordination and subordination). I'd be happy to believe you, but again that's a theoretical choice.


Quote
If we try to trace the word building back to it's roots (what I meant by etymologically speaking – by the way, am I using the word etymologically wrong?), one can find out how the word came to be. Like building is a form of build, and build was originally a verb.
What's the point? Do you care what words are now? Or where they came from? Those are completely different questions. If you meet a new person, do you ask where they live or where they were born? Different questions, for different reasons.

There was a discussion quite a while ago about how all words (in English at least) appear to be possible to trace back to a noun form, etymologically. If you go back far enough (theoretically) almost everything was a thing at some point, and usage lead to changes in class. That's mildly interesting as an observation (about the evolution of language for example) but it doesn't seem even remotely helpful for what you're asking. Why does it matter where a word comes from? Does that change its class today? Again, that is another theoretical assumption. Personally I don't care much about where a word comes from when I'm trying to understand its usage today. If I want to understand the history, I do look at the etymology. (And yes, you're using etymology correctly!)
Welcome to Linguist Forum! If you have any questions, please ask.

Offline Corybobory

  • Global Moderator
  • Linguist
  • *****
  • Posts: 138
  • Country: gb
    • English
    • Coryographies: Handmade Creations by Cory
Re: Estimating word class size. Why cannot it be done?
« Reply #6 on: January 26, 2014, 04:28:45 AM »
It would be really interesting to see a corpus of all the words one person uses in an entire year.  Has this sort of data ever been collected before?

From that you could see how many words from each class there are in their natural vocabulary - it might be interesting to compare that then with other speakers.

I guess the problem is some types of word classes are productive and others aren't - and some words are used that belong to many word classes, but are used in certain contexts etc, and it muddles the numbers - but if it was an individuals use only, that might be an interesting thing to analyze?

The transcriptions would take ages though...
BA Linguistics, MSt Palaeoanthropology and Palaeolithic Archaeology, current PhD student (Archaeology, 1st year)

Blog: http://www.palaeolinguist.blogspot.com
My handmade book jewellery: http://www.coryographies.etsy.com