The OEC: Facts about the language

oxford corpus image

 

The 20-volume historical Oxford English Dictionary is the largest record of words used in English, past and present. It contains words that are now obsolete or rare (such as xenagogue 'a person who guides strangers' and vicine 'neighbouring or adjacent') in addition to the latest coinages such as phishing and podcast.

The second edition of the OED, published in 1989 and consisting of twenty volumes, contains more than 615,000 entries, and the third, available online, is expanding all the time, with batches of 2,500 new and revised words and phrases being added in regular quarterly updates.

How many words are there in English?

It is a question often asked, but not so easily answered. Even the OED does not set out to include every specialized technical term or slang or dialect expression ever used. New words are constantly being invented, developed from existing words, or adopted from other languages. Most will be used rarely, or only by a small group of people. This means that an unlimited number of words may occur in speech and writing which will never be recorded in even the largest dictionary.

Furthermore, what exactly is a word? Clearly we should include single units such as cat and dog. But are the plurals cats and dogs separate words? Should we include compounds such as walking stick, which are made up of two existing words? There are an almost unlimited number of such two-word compounds, which can't all be included in a dictionary. And what about abbreviations like BBC and Dr,  or proper names such as London, Nelson, and Harry Potter: are they words? As you can see, the question is not a straightforward one.

How many words do we use?

Although it may be impossible to know the number of words in English, the Oxford English Corpus can help us assess the number of words in current use.

Instead of talking about words, it's more useful in this context to talk about lemmas, a lemma being the base form of a word. For example, climbs, climbing, and climbed are all examples of the one lemma climb. Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the words used in the Oxford English Corpus. If you were to read through the corpus, one word in four (ignoring proper names) would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.

The remaining 5% of the corpus consists of a very large number of lemmas which occur rarely: words like moidore or parados, which may occur only once every several million words. Like all natural languages, English consists of a small number of very common words, a larger number of intermediate ones, and then an indefinitely long 'tail' of very rare terms.

Vocabulary size (no. lemmas) % of content in OEC Example lemmas
10 25% the, of, and, to, that, have
100 50% from, because, go, me, our, well, way
1000 75% girl, win, decide, huge, difficult, series
7000 90% tackle, peak, crude, purely, dude, modest
50,000 95% saboteur, autocracy, calyx, conformist
>1,000,000 99% laggardly, endobenthic, pomological

 The long tail means that to account for 99% of the Oxford English Corpus you would need a vocabulary of more than a million lemmas. This would include some words which may occur only once or twice in the whole corpus: highly technical terms like chrondrogenesis or dicarboxylate, and one-off coinages like bootlickingly or unsurfworthy that people would probably understand but would be unlikely to use.

If we decide that around 90-95% of the corpus gives a reasonable idea of an average vocabulary, we are left with a figure somewhere in the range of 7,000-50,000 lemmas: say, 25,000. What does a vocabulary of this size represent? It represents the set of most significant words in English: those which occur reasonably frequently and which account for all but a small part of everything we may encounter in speech or writing. It includes all the words that we actively use in general everyday life.

It's interesting to note that most reasonably sized dictionaries contain significantly more than 25,000 lemmas.The 11th edition of the Concise Oxford English Dictionary, for example, lists more than 75,000 single-word lemmas, which means that the majority of its entries must belong to the long tail of extremely rare words. This makes good sense: such terms occur very infrequently, but when they do they are likely to be crucial to what's being said, and the reader might well want to look them up.The idea of a quantifiable vocabulary should be seen in this light: the words we ignore for the purposes of the exercise may be very rare, but in context they may be very important.

What is the commonest word?

Based on the evidence of the Oxford English Corpus, which currently contains over 2 billion words, the 100 commonest English words found in writing around the world are as follows:

1     the
2     be
3     to
4     of
5     and
6     a
7     in
8     that
9     have
10    I
11    it
12    for
13    not
14    on
15    with
16    he
17    as
18    you
19    do
20    at
21    this
22    but
23    his
24    by
25    from
 
26    they
27    we
28    say
29    her
30    she
31    or
32    an
33    will
34    my
35    one
36    all
37    would
38    there
39    their
40    what
41    so
42    up
43    out
44    if
45    about
46    who
47    get
48    which
49    go
50    me
 
51    when
52    make
53    can
54    like
55    time
56    no
57    just
58    him
59    know
60    take
61    people
62    into
63    year
64    your
65    good
66    some
67    could
68    them
69    see
70    other
71    than
72    then
73    now
74    look
75    only
 
76    come
77    its
78    over
79    think
80    also
81    back
82    after
83    use
84    two
85    how
86    our
87    work
88    first
89    well
90    way
91    even
92    new
93    want
94    because
95    any
96    these
97    give
98    day
99    most
100   us
 

It's noticeable that many of the most frequently used words are short ones whose main purpose is to join other, longer words rather than determine the meaning of a sentence. These are known as 'function words'. It could be said that it's more interesting to explore the frequency of 'content words', as shown in the list below:

Nouns Verbs Adjectives
1       time
2       person
3       year
4       way
5       day
6       thing
7       man
8       world
9       life
10      hand
11      part
12      child
13      eye
14      woman
15      place
16      work
17      week
18      case
19      point
20      government
21      company
22      number
23      group
24      problem
25      fact
 
1       be
2       have
3       do
4       say
5       get
6       make
7       go
8       know
9       take
10      see
11      come
12      think
13      look
14      want
15      give
16      use
17      find
18      tell
19      ask
20      work
21      seem
22      feel
23      try
24      leave
25      call
 
1       good
2       new
3       first
4       last
5       long
6       great
7       little
8       own
9       other
10      old
11      right
12      big
13      high
14      different
15      small
16      large
17      next
18      early
19      young
20      important
21      few
22      public
23      bad
24      same
25      able
 

 

Nouns

The commonest nouns are time, person, and year, followed by way and day (month is 40th). The majority of the top 25 nouns (15) are from Old English, and of the remainder, most came into medieval English from Old French, and before that from Latin. Notice that many of these words are very common because they have more than one meaning: way and part, for example, are listed in the Concise OED as having 18 and 16 different meanings respectively. They often also form part of common phrases: some of the frequency of time, for example, comes from its use in adverbial phrases like on time, in time, last time, next time, this time, etc.

Verbs

As you would expect, the commonest verbs express basic concepts. Strikingly, the 25 most frequent verbs are all one-syllable words; the first two-syllable verbs are become (26th) and include (27th). Of these 25, 20 are Old English words, and three more, get, seem, and want, entered English from Old Norse in the early medieval period. Only try and use came from Old French. It seems that English prefers terse, ancient words to describe actions or occurrences.

Adjectives

Again, most of the top adjectives are one-syllable words, and 17 out of 25 derive from Old English: only different, large, and important are from Latin. In terms of the words' meanings, great is higher in the ranking than big, probably because of its informal sense 'very good'. Little is surprisingly high at 7, as compared with small at 15. Bad is unexpectedly low at 23: is this because we have such a large choice of synonyms available for expressing 'bad things'?


Reference to come or go over big in Language Resources