Using the Corpus
The Oxford English Corpus can be used in many different ways to study the English language and the cultures in which it is used. Because it is large, and because it is made up of text from many different subject areas and types of text, it acts as a representative slice of contemporary English from which all aspects of written language can be studied. In this section you can see some examples of different types of corpus analysis, particularly those relevant to dictionary writing.
Words don't exist in isolation. They have strong attractions for other words, and form patterns and associations that are often regular and predictable, though not usually rigid or permanent. These patterns form part of the innate knowledge of a native speaker of the language.
Understanding a word and its behaviour means looking at the other words, or collocates, with which it's typically found. Corpus analysis software, such as the Sketch Engine software used by Oxford Dictionaries (see www.sketchengine.co.uk) has revolutionized this kind of research because it can be used to build a detailed statistical profile of a word and its collocates in a matter of seconds, revealing typical usage and indicating the connotations that the word may carry.
Below you can see the collocational profile for the word eccentric in the Oxford English Corpus. The column headings describe the relationship of the words listed to the word in question, so that words listed in the first column as 'modifiers' are adverbs, as in 'slightly eccentric', 'somewhat eccentric', and so on, while words listed in the second column under 'modifies' are nouns modified by eccentric, as in 'eccentric millionaire' and 'eccentric character'. The third column lists adjectives which co-occur with eccentric.
What does this tell us about eccentric? We can spot a number of technical uses (orbit, contraction, femoral, axial), but if we leave these aside and focus on the main sense of the word, some characteristics emerge. Eccentric often occurs with adverbs like endearingly, charmingly, and delightfully, and with other adjectives like lovable and colourful: it appears to have positive connotations. Collocates like millionaire, billionaire, old, elderly, rich, wealthy suggest that we are most likely to use eccentric of elderly, wealthy people. Recluse, reclusive, loner, lonely (and perhaps bachelor) suggest solitary people. It's intriguing that the collocational profile includes both uncle and aunt: are aunts and uncles more likely to be eccentric than any other relatives? Finally, it appears that you are most likely to be described as eccentric if you are British or German.
Compare this with the word quirky. Although quirky has a similar meaning to eccentric, collocation reveals different patterns of use:
Whereas eccentric is associated with being elderly, rich, or reclusive, quirky is most strongly associated with being humorous or youthful: collocates include playful, cute, whimsical, funny, and adorable. Unlike eccentric, quirky is not typically used of people, but rather of their behaviour and characteristics (humour, smile, etc.). Quirky is also associated with art and creativity: songs, lyrics, films, and novels may be quirky, but very rarely eccentric.
Collocation patterns rarely indicate absolute 'rules': it wouldn't be wrong to use eccentric of a young person, or quirky with reference to an old person. But collocation does indicate the implicit connotations and attitudes that go along with the language we use, and which influence our choice of one word rather than another: it feels more natural to describe a rich old uncle as eccentric and to describe his young niece as quirky, rather than the other way round.
The idea of one's 'inner child', popularized in psychotherapy in the 1980s, has spawned an array of humorous variations. These illustrate the way that language is routinely exploited and extended, not as part of a literary endeavour but simply as part of normal creativity in language use. In the Oxford English Corpus the most common of these are (in order):
- inner geek
- inner nerd
- inner diva
- inner dweeb
- inner slut
- inner cynic
- inner hippie
- inner brat
The corpus helps to identify the most productive ways in which new words and expressions are coined, and to rank the popularity of coinages. For example, the suffixes -fest, -speak, -tastic, and -ville are all highly prolific in English today, and their use can reveal some of the interests and concerns of our society:
The most common uses of -fest are: slugfest, lovefest, gabfest, crapfest, talkfest, gorefest, snoozefest, hatefest, bitchfest, snorefest, geekfest, gabfest, bloodfest, blogfest, songfest, shitfest, screamfest, filmfest, yawnfest, funfest, sobfest, plugfest, mudfest, fragfest, and suckfest.
The most common uses of -speak are: management-speak, corporate-speak, marketing-speak, geek-speak, business-speak, therapy-speak, art-speak, lawyer-speak, media-speak, government-speak, consultant-speak, technospeak, adspeak, PR-speak, science-speak, politispeak, military-speak, computer-speak, BBC-speak, tech-speak, legal-speak, and left-speak.
The most common uses of -tastic are: craptastic, poptastic, funktastic, fabtastic, pimptastic, creeptastic, blingtastic, ego-tastic, retrotastic, geektastic, and blogtastic.
The most common uses of -ville are: dumpsville, dullsville, squaresville, hicksville, smallville, stupidville, and shitsville.
A number of common words in English started out as two-word phrases and eventually became fused as single-word forms: forever, somebody, everyone.
The Oxford English Corpus shows the process continuing today. The chart below gives some examples. For instance, it shows that the phrase some time now appears as the fused single-word form sometime in 32% of all occurrences in American English and 19% of all occurrences in British English.
The tendency to fuse fixed expressions is more common in American than British English. In American English someday has now become more or less standard, substantially outnumbering occurrences of some day; anymore and underway look set to follow. Although the same trend is apparent in British English, it tends to lag behind.
- Fused forms almost always emerge first in informal English (the blog and message board parts of the corpus) and are much slower to spread to more formal, edited text such as newspapers and magazines; of the examples shown here, only someday is well represented across all text types.
- Fused forms seem to spread more easily if there is a direct analogy with an existing word: anymore benefits from the analogy with anyone and anybody, whereas ofcourse is almost non-existent because there are no comparable of- words.
- The tendency to fuse may be stronger when the phrase occurs at the end of a clause: 84% of instances of anymore occur at the end of a clause, compared with 46% of instances of any more.
The Oxford English Corpus contains about 1,000 instances of could of and would of, as in I would of stopped her. About 850 of these occur in representations of direct speech (mostly from the Fiction domain, but also from interviews and courtroom transcripts).This leaves 150 instances of could of and would of as a genuine written form compared with 4 million instances of the standard English syntax would have and could have. However willing we may be to convert have to of in spoken English, the corpus shows that the habit has not spread into written English.
The most common animal word in the corpus is dog (the 997th most frequent word), followed (in order) by fish, horse, bird, cat, fox, chicken, mouse, cow, bull, lion, rat, tiger, pig, wolf, snake, and sheep. Analysis of animal words in the corpus is complicated: English uses animal words in a dazzling array of idioms and metaphors, often nothing to do with actual animals. We can use the Oxford English Corpus to explore this rich figurative language.
Statistical analysis of similes involving animal words (in the pattern as ... as a cat/dog etc.) generates a detailed picture of the characteristics that English ascribes to animals:
- cat: nimble, curious, nervous, silent, comfortable, cool
- dog: sick, loyal, friendly
- horse: healthy, hungry
- bull: strong, mad, angry
- lion: brave, righteous, fierce, bold, protective, strong
- pig: happy, foul, drunk, sick
- fox: sly, smart
It is apparent that these characteristics are largely linguistic conventions and often have little to do with our understanding of real animals: horses are healthy, but dogs and pigs (and, according to footballers, parrots) are sick.