Linguistic Corpora

Linguistic corpora give users the opportunity to investigate language use from different types of sources across time. Different corpora focus on different types of data and different time periods with the key feature being that all the data is compiled in one place for easy searchability. Searches allow the user to explore ups and downs in usage, how usages might change, acceleration or decline in usage, and how different sources use words or phrases differently. Beyond that, some corpora allow for searches of specific words in combination with different word classes (e.g. COCA searches for because [*noun*]). Corpora are a valuable resource for anyone interested in language use and history.

*Image from https://www.channelone.com/blog_post/web-tools-for-studying-vocabulary-words/

BYU Corpora
Brigham Young University gives access to the corpora listed below:

  • Corpus of Contemporary America English (COCA): Possibly the most widely used corpus of American English, COCA has over 520 million words of text spanning from 1990 to 2015 and divided by genre (spoken, fiction, popular magazines, newspapers, and academic texts).
  • British National Corpus: Large (over 100 million words) corpus of spoken and written modern English designed to represent as wide a range of modern British English as possible, including extracts from regional and national newspapers, specialist periodicals and journals, academic books and popular fiction, published and unpublished letters, school and university essays, and scripted and unscripted informal conversation.
  • Strathy Corpus (Corpus of Canadian English): A 50 million word collection of Canadian English from over 1100 texts sorted by genre (spoken, fiction, magazines, newspapers, and academic texts).
  • There are also corpora in Spanish and Portuguese.
  • Historical Corpora:
  • Corpus of Historical American English (COHA): One of the larger historical corpora of English, COHA contains over 400 millions words of text spanning from the 1810s to 2000s organized by genre and decade.
  • TIME Magazine Corpus: Trace changes in American English through this 100 million word corpus of text taken from 275,000 TIME magazine articles spanning from 1923 to 2006.
  • Hansard Corpus: Collection of British Parliament speeches containing nearly every speech given from 1803 to 2005. Allows for semantic-based searches (see page for more details).
  • Web-Based Corpora:
  • NOW Corpus (News on the Web): A 4.3 billion word collection of web-based newspapers and magazines spanning from 2010 to the present (updated daily with 5-6 million words from the preceding day), allowing the user to search up-to-date language trends.
  • Global Web-Based English Corpus (GloWbE): One of the larger corpora of international English, GloWbE is a 1.9 billion word corpus with texts from twenty countries that allows online searches and downloads of full-text data.
  • Wikipedia Corpus: 1.9 billion words from the full text of Wikipedia (4.4 million articles) that allows searches by word, phrase, part of speech, and synonyms with tools to locate collocates and create virtual corpora (see page for details) that can narrow searches to certain topics, grammatical constructions, and/or keywords.
  • CORE Corpus (Corpus of Online Registers of English): 50 million words of web-based texts categorized by register.

Google Ngrams: A feature that searches Google Books for instances of words and phrases. The link above is to the BYU version that has a few more features and allows the user to search American, British, and Spanish. The original Google version can be found here.

The Grammar Lab: Compilation of links to four corpora (1,000,000 Word Sample Corpora, Corpus of Presidential Speeches, MICUSP for AntCon and Wordsmith, and Song Lyrics Data Tables) plus project guides and information on statistics specific to the use of linguistic corpora.

Michigan Corpus of Academic Spoken English (MiCASE)
A searchable corpus of academic spoken English, which includes 152 transcripts totaling 1,848,364 words. This is an invaluable resource for examining patterns in spoken academic American English.

Michigan Corpus of Upper-Level Student Papers (MICUSP)
This resource contain student papers from 16 disciplines across 4 levels (senior, 1st year grad, 2nd year grad, 3rd year grad), 7 paper types (e.g. argumentative essay, creative writing, research paper), and 8 textual features (e.g. abstract, methodology, literature review).

Buckeye Corpus
Corpus of 40 speakers from Columbus, Ohio conversing with an interviewer in high-quality recordings. The interviews are transcribed and phonetically labeled.

The Sociolinguistic Archive and Analysis Project (SLAAP)
A collection of interviews, audio files, and transcriptions compiled by North Carolina State University. Access is password protected and can be requested by emailing the project coordinator (found by following the Personnel link on the left of the page)

Santa Barbara Corpus of Spoken American English (SBCSAE)
Corpus containing around 249,000 words with transcriptions, audio, and timestamps (sometimes for individual intonation units)