Replication Data for: Quantifying gender biases towards politicians on Reddit
Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)
View Archive InfoField | Value | |
Title |
Replication Data for: Quantifying gender biases towards politicians on Reddit
|
|
Identifier |
https://doi.org/10.7910/DVN/YWRXEP
|
|
Creator |
Marjanovic, Sara
Stanczak, Karolina Augenstein, Isabelle |
|
Publisher |
Harvard Dataverse
|
|
Description |
This dataset contains ~10 million comments posted on Reddit between July 2018 and December 2019 that mention a cis-male or cis-female politician. They were extracted from pushshift's historical data dumps of Reddit comments (https://files.pushshift.io/reddit/comments/). We extracted subreddits of political relevance and then isolated comments about politicians using a pre-trained named entity linker. These comments were then used to look at gender biases in comment content (e.g. sentiment and specific adjectives used) and structure (e.g. comment length). We present this dataset for others to use to investigate political gender biases presented on public fora. The file is compressed as a .7z file and decompresses into a 13 GB .csv file containing all comments used in our paper. The CSV contains the Reddit comment IDs, comment texts (with the politician's name obscured with the token [NAME]), Wikidata ID of the mentioned politician, name used to refer to the politician in question, and various information about the politician as linked to their Wikidata ID (e.g. gender, country of origin, etc.). All comments should be in the English language as they were extracted from predominantly English-speaking communities. You can read more details about our methodology on comment collection and our investigation on the presented gender biases at our preprint (https://arxiv.org/pdf/2112.12014). |
|
Subject |
Computer and Information Science
Social Sciences |
|
Contributor |
Marjanovic, Sara Vera
|
|