The OEC: Composition and structure
The Oxford English Corpus is based mainly on material collected from pages on the World Wide Web. Some printed texts, such as academic journals, have been used to supplement certain subject areas.
The extensive use of web pages enables us to build a corpus of unprecedented scale and variety. The Oxford English Corpus is intended to be as wide-ranging as possible in its representation of the English language. Development was planned to ensure a balanced range of material from different subject areas, regions of the world, and types of writing. Structuring a corpus in this way produces a panoramic view of language use in every area of human life.
The corpus is divided into 20 major subject areas, as shown below:
|Subject area||% of content in the corpus|
|Life and Leisure||5%|
As quickly becomes clear from these statistics, News accounts for the largest percentage of corpus data.
Each main subject area is further divided into a series of more specific categories. For example, Sport is divided into about 40 individual sports including baseball, basketball, sailing, soccer, etc. This makes it possible to explore the language of a particular subject area, or to compare two subject areas, or to investigate how the behaviour of a word changes in different contexts.
English around the world
The Oxford English Corpus is dominated by British and US English, which together make up 80% of all text in the corpus. The remaining 20% (over 400 million words) is made up of varieties of English from around the world: Australian, South African, Canadian, Caribbean, etc. It also includes material from regions such as India, Singapore, and Hong Kong, where English is often used as a second language. The geographical range of the corpus is crucial for building a detailed picture of English as a global language.
Text types and register
Text type or register refers to the different levels of language that may be used in different contexts. For example, writing about soccer may range from the formal (official regulations) to the very informal (fans’ blogs or comments on online message boards). The Oxford English Corpus has been carefully composed to ensure that the full range of registers is represented: the following list shows some of the kinds of writing that it contains:
- academic papers
- technical manuals
- newspaper reports, columns, and opinion pieces
- corporate websites
- magazine articles
- novels and short stories
- underground and counterculture websites
- personal websites
- message board postings
Journals, newspapers, and magazines are valuable for building a picture of norms and standards in English usage. Personal websites, blogs, and message boards, on the other hand, allow us to examine non-standard language such as slang, regionalisms, and newly coined words or expressions. For dictionary editors providing guidance on standard English, these sources also provide a good way of tracking common errors in written English (e.g. spelling mistakes or meaning confusion), which can then be used for writing properly targeted extra usage notes. Of course, it’s quite likely that some of today’s ‘mistakes’ in informal contexts such as blogs or message boards will eventually lead to changes in standard usage. The range of text types used in the Oxford English Corpus allows us to identify very precisely how language develops and how standards shift.
The material in the Oxford English Corpus dates from the year 2000 onwards. New text is continuously collected, with a new batch added every few months. As the corpus continues to develop, it will be possible to trace language change over time: words becoming increasingly or decreasingly common, features spreading from one region to another, and new meanings emerge.