What is a corpus?

A corpus (plural: corpora) is a collection of written or spoken language data in a computer-readable format. It brings together large quantities of language evidence from a variety of real situations which lexicographers use to compile accurate and meaningful dictionary entries.

The evidence that we extract from corpora is at the heart of dictionary-making in Oxford in the 21st century, allowing us to track and record the very latest developments in language.

Specialized software analyses the corpora to identify how words are related, identify new and emerging words and senses, and spot trends and patterns in usage, spelling, regional varieties, and more.

A corpus enables lexicographers to examine a word in detail by showing the different contexts in which it occurs. For example, all the occurrences of a word can be grouped to reveal its most frequent usage patterns. Here is a short extract from the Oxford English Corpus of the word sublime:

The Oxford English Corpus

The Oxford English Corpus is one of many language corpora used by Oxford lexicographers. It contains more than 10 billion words of real 20th and 21st-century English, mostly drawn from the web. It is one of the largest language corpora in the world and growing at an average rate of 150 million words a month.

It represents all types of English, from research and specialist journals to newspapers and magazines, together with the language of blogs, emails, and social media, not only from the UK and the United States but from all parts of the English-speaking world including Ireland, Australia, New Zealand, the Caribbean, Canada, India, Singapore, and South Africa.

The Oxford English Corpus is just one of many corpora that our lexicographers use. We have large corpora in Arabic, Portuguese, Spanish, and many other languages. 

Keeping track of language change

At Oxford Dictionaries, our role is to monitor and record emerging vocabulary so that we can make new terms available to our dictionary users as soon as they start to gain traction. This involves identifying and tracking new words, capturing new word meanings, and updating existing dictionary entries with new evidence. Having so much language data available makes the work of a 21st-century lexicographer both exciting and challenging. It is made possible only by our unique combination of world-class technology and the expertise of our lexicographers.

