Corpus of Contemporary American English (COCA)
Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)
View Archive InfoField | Value | |
Title |
Corpus of Contemporary American English (COCA)
|
|
Identifier |
https://doi.org/10.7910/DVN/AMUDUW
|
|
Creator |
Davies, Mark
|
|
Publisher |
Harvard Dataverse
|
|
Description |
Largest structured corpus of American English composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. The corpus is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts
|
|
Subject |
Arts and Humanities
Other English language Corpora (Linguistics) Computational linguistics |
|
Language |
English
|
|
Contributor |
McNeill, Katherine
|
|
Relation |
Corpus of Historical American English (COHA)
|
|
Type |
linguistic corpora
|
|
Source |
The corpus is composed of more than 450 million words in 189,431 texts, including 20 million words each year from 1990-2012. Detailed information on sources is available at: http://corpus.byu.edu/coca/?f=texts_e. Main sources for each file type are as follows: • Spoken: (95 million words [95,385,672]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts). • Fiction: (90 million words [90,344,134]) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts. • Popular Magazines: (95 million words [95,564,706]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc. • Newspapers: (92 million words [91,680,966]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc. • Academic Journals: (91 million words [91,044,778]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year |
|