What can the Oxford English Corpus tell us about the English language?
How many words are there in English?
It is a question often asked, but not so easily answered. New words are constantly being invented, developed from existing words, or adopted from other languages. Most will be used rarely, or only by a small group of people. This means that an unlimited number of words may occur in speech and writing which will never be recorded in even the largest dictionary.
Furthermore, what exactly is a word? Clearly we should include single units such as cat and dog. But are the plurals cats and dogs separate words? Should we include compounds such as walking stick, which are made up of two existing words? There are an almost unlimited number of such two-word compounds, which can’t all be included in a dictionary. Are contractions like can’t and won’t one word or two? And what about abbreviations like BBC and Dr, or proper names such as London, Nelson, and Harry Potter: are they words? As you can see, the question is not a straightforward one.
How many words do we use?
Although it may be impossible to know the number of words in English, the Oxford English Corpus can help us assess the number of words in current use.
Instead of talking about words, it’s more useful in this context to talk about lemmas, a lemma being the base form of a word. For example, climbs, climbing, and climbed are all examples of the one lemma climb. Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the words used in the Oxford English Corpus. If you were to read through the corpus, one word in four (ignoring proper names) would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.
The remaining 5% of the corpus consists of a very large number of lemmas which occur rarely: words like moidore or parados, which may occur only once every several million words. Like all languages, English consists of a small number of very common words, a larger number of intermediate ones, and then an indefinitely long ‘tail’ of very rare terms.
|Vocabulary size (no. lemmas)||% of content in OEC||Example lemmas|
|10||25%||the, of, and, to, that, have|
|100||50%||from, because, go, me, our, well, way|
|1000||75%||girl, win, decide, huge, difficult, series|
|7000||90%||tackle, peak, crude, purely, dude, modest|
|50,000||95%||saboteur, autocracy, calyx, conformist|
|>1,000,000||99%||laggardly, endobenthic, pomological|
The long tail means that to account for 99% of the Oxford English Corpus you would need a vocabulary of more than a million lemmas. This would include some words which may occur only once or twice in the whole corpus: highly technical terms like chrondrogenesis or dicarboxylate, and one-off coinages like bootlickingly or unsurfworthy that people would probably understand but would be unlikely to use.
If we decide that around 90-95% of the corpus gives a reasonable idea of an average vocabulary, we are left with a figure somewhere in the range of 7,000-50,000 lemmas: say, 25,000. What does a vocabulary of this size represent? It represents the set of most significant words in English: those which occur reasonably frequently and which account for all but a small part of everything we may encounter in speech or writing. It includes all the words that we actively use in general everyday life.
It’s interesting to note that most reasonably sized dictionaries contain significantly more than 25,000 lemmas. The 11th edition of the Concise Oxford English Dictionary, for example, lists more than 75,000 single-word lemmas, which means that the majority of its entries must belong to the long tail of extremely rare words. This makes good sense: such terms occur very infrequently, but when they do they are likely to be crucial to what’s being said, and the reader might well want to look them up. The idea of a quantifiable vocabulary should be seen in this light: the words we ignore for the purposes of the exercise may be very rare, but in context they may be very important.
What is the commonest word in English?
Based on the evidence of the Oxford English Corpus, which currently contains over 2 billion words, the 100 commonest English words found in writing around the world are as follows:
It’s noticeable that many of the most frequently used words are short ones whose main purpose is to join other, longer words rather than determine the meaning of a sentence. These are known as ‘function words’. It could be said that it’s more interesting to explore the frequency of ‘content words’, as shown in the list below:
Frequency by part of speech
The commonest nouns are time, person, and year, followed by way and day (month is 40th). The majority of the top 25 nouns (15) are from Old English, and of the remainder, most came into medieval English from Old French, and before that from Latin. Many of these words are very common because they have more than one meaning: way and part, for example, are listed in the Concise OED as having 18 and 16 different meanings respectively. They often also form part of common phrases: some of the frequency of time, for example, comes from its use in adverbial phrases like on time, in time, last time, next time, this time, etc.
As you would expect, the commonest verbs express basic concepts. Strikingly, the 25 most frequent verbs are all one-syllable words; the first two-syllable verbs are become (26th) and include (27th). Of these 25, 20 are Old English words, and three more, get, seem, and want, entered English from Old Norse in the early medieval period. Only try and use came from Old French. It seems that English prefers terse, ancient words to describe actions or occurrences.
Again, most of the top adjectives are one-syllable words, and 17 out of 25 derive from Old English: only different, large, and important are from Latin. In terms of the words’ meanings, great is higher in the ranking than big, probably because of its informal sense ‘very good’. Little is surprisingly high at 7, as compared with small at 15. Bad is unexpectedly low at 23: is this because we have such a large choice of synonyms available for expressing ‘bad things’?