Record Details

Corpus of Historical American English (COHA)

Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)

View Archive Info
 
 
Field Value
 
Title Corpus of Historical American English (COHA)
 
Identifier https://doi.org/10.7910/DVN/8SRSYK
 
Creator Davies, Mark
 
Publisher Harvard Dataverse
 
Description Largest structured corpus of historical English. COHA containing more than 400 million words of text of American English from 1810 to 2009.
 
Subject Arts and Humanities
Other
English language
Corpora (Linguistics)
Computational linguistics
 
Language English
 
Contributor McNeill, Katherine
 
Relation Corpus of Contemporary American English (COCA)
 
Type linguistic corpora
 
Source The corpus is composed of more than 400 million words of text in more than 100,000 individual texts. The major sources for each genre are as follows:
• Fiction: Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), movie and play scripts, COCA (1990-2010).
• Magazine: Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2010). Note: In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s).
• Newspaper: PDF > TXT of at least five newspapers (1850-1980), COCA etc (1990-2010)
• Non-fiction: Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010). Note: In each decade, the non-fiction is balanced across the Library of Congress classification system.

For more information, see: http://corpus.byu.edu/coha/?f=texts_e.