Record Details

Replication Data for: Quantifying gender biases towards politicians on Reddit

Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)

View Archive Info
 
 
Field Value
 
Title Replication Data for: Quantifying gender biases towards politicians on Reddit
 
Identifier https://doi.org/10.7910/DVN/YWRXEP
 
Creator Marjanovic, Sara
Stanczak, Karolina
Augenstein, Isabelle
 
Publisher Harvard Dataverse
 
Description This dataset contains ~10 million comments posted on Reddit between July 2018 and December 2019 that mention a cis-male or cis-female politician. They were extracted from pushshift's historical data dumps of Reddit comments (https://files.pushshift.io/reddit/comments/). We extracted subreddits of political relevance and then isolated comments about politicians using a pre-trained named entity linker. These comments were then used to look at gender biases in comment content (e.g. sentiment and specific adjectives used) and structure (e.g. comment length). We present this dataset for others to use to investigate political gender biases presented on public fora.

The file is compressed as a .7z file and decompresses into a 13 GB .csv file containing all comments used in our paper. The CSV contains the Reddit comment IDs, comment texts (with the politician's name obscured with the token [NAME]), Wikidata ID of the mentioned politician, name used to refer to the politician in question, and various information about the politician as linked to their Wikidata ID (e.g. gender, country of origin, etc.). All comments should be in the English language as they were extracted from predominantly English-speaking communities.

You can read more details about our methodology on comment collection and our investigation on the presented gender biases at our preprint (https://arxiv.org/pdf/2112.12014).
 
Subject Computer and Information Science
Social Sciences
 
Contributor Marjanovic, Sara Vera