Corpus of Historical American English (COHA)
Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)
View Archive InfoField | Value | |
Title |
Corpus of Historical American English (COHA)
|
|
Identifier |
https://doi.org/10.7910/DVN/8SRSYK
|
|
Creator |
Davies, Mark
|
|
Publisher |
Harvard Dataverse
|
|
Description |
Largest structured corpus of historical English. COHA containing more than 400 million words of text of American English from 1810 to 2009.
|
|
Subject |
Arts and Humanities
Other English language Corpora (Linguistics) Computational linguistics |
|
Language |
English
|
|
Contributor |
McNeill, Katherine
|
|
Relation |
Corpus of Contemporary American English (COCA)
|
|
Type |
linguistic corpora
|
|
Source |
The corpus is composed of more than 400 million words of text in more than 100,000 individual texts. The major sources for each genre are as follows: • Fiction: Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), movie and play scripts, COCA (1990-2010). • Magazine: Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2010). Note: In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s). • Newspaper: PDF > TXT of at least five newspapers (1850-1980), COCA etc (1990-2010) • Non-fiction: Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010). Note: In each decade, the non-fiction is balanced across the Library of Congress classification system. For more information, see: http://corpus.byu.edu/coha/?f=texts_e. |
|