Record Details


Field	Value

Title	Corpus of Historical American English (COHA)

Identifier	https://doi.org/10.7910/DVN/8SRSYK

Creator	Davies, Mark

Publisher	Harvard Dataverse

Description	Largest structured corpus of historical English. COHA containing more than 400 million words of text of American English from 1810 to 2009.

Subject	Arts and Humanities Other English language Corpora (Linguistics) Computational linguistics

Language	English

Contributor	McNeill, Katherine

Relation	Corpus of Contemporary American English (COCA)

Type	linguistic corpora

Source	The corpus is composed of more than 400 million words of text in more than 100,000 individual texts. The major sources for each genre are as follows: • Fiction: Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), movie and play scripts, COCA (1990-2010). • Magazine: Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2010). Note: In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s). • Newspaper: PDF > TXT of at least five newspapers (1850-1980), COCA etc (1990-2010) • Non-fiction: Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010). Note: In each decade, the non-fiction is balanced across the Library of Congress classification system. For more information, see: http://corpus.byu.edu/coha/?f=texts_e.

ICAR Research Data Repository for Knowledge Management