HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust
Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)
View Archive InfoField | Value | |
Title |
HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust
|
|
Identifier |
https://doi.org/10.7910/DVN/HAKKUA
|
|
Creator |
Bagga, Sunyam
Piper, Andrew |
|
Publisher |
Harvard Dataverse
|
|
Description |
We release a new dataset of 1,660,401 randomly sampled pages of English-language prose from the Hathi Trust. It is roughly divided between modes of fictional and non-fictional writing and published between the years 1800 and 2000. In addition to focusing on the "page" as the basic bibliographic unit, our work employs a single predictive model for the historical period under consideration in contrast to prior work. In addition to providing publication metadata, we also provide an enriched feature set of 109 features including part-of-speech, sentiment scores, word "super-senses" and more. Our data is designed to give researchers in the digital humanities large yet portable random samples of historical writing across two foundational modes of English prose writing.
|
|
Subject |
Arts and Humanities
Computer and Information Science digital humanities, cultural heritage |
|
Contributor |
Bagga, Sunyam
|
|