Description |
This dataset contains the data and code necessary to replicate work in
the following paper:
Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill,
and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of
an Interactive Tutorial for New Users.” in Proceedings of the 20th
ACM Conference on Computer-Supported Cooperative Work & Social
Computing (CSCW '17). New York, New York: ACM Press.
http://dx.doi.org/10.1145/2998181.2998307
The published paper contains two studies. Study 1 is a descriptive
analysis of a survey of Wikipedia editors who played a gamified
tutorial. Study 2 is a field experiment that evaluated the same the
tutorial. These data are the data used in the field experiment
described in Study 2.
Description of Files
This dataset contains the following files beyond this README:
- twa.RData — An RData file that includes all variables used in Study
2.
- twa_analysis.R — A GNU R script that includes all the code used to
generate the tables and plots related to Study 2 in the paper.
The RData file contains one variable (d) which is an R dataframe
(i.e., table) that includes the following columns:
userid (integer): The unique numerical ID representing each user on
in our sample. These are 8-digit integers and describe public
accounts on Wikipedia.
sample.date (date string): The day the user was recruited to the
study. Dates are formatted in “YYYY-MM-DD” format. In the case of
invitees, it is the date their invitation was sent. For users in the
control group, these is the date that they would have been invited
to the study.
edits.all (integer): The total number of edits made by the user on
Wikipedia in the 180 days after they joined the study. Edits to
user's user pages, user talk pages and subpages are ignored.
edits.ns0 (integer): The total number of edits made by user to
article pages on Wikipedia in the 180 days after they joined the
study.
edits.talk (integer): The total number of edits made by user to talk
pages on Wikipedia in the 180 days after they joined the
study. Edits to a user's user page, user talk page and subpages are
ignored.
treat (logical): TRUE if the user was invited, FALSE if the user was
in control group.
play (logical): TRUE if the user played the game. FALSE if the user
did not. All users in control are listed as FALSE because any user
who had not been invited to the game but played was removed.
twa.level (integer): Takes a value 0 of if the user has not played
the game. Ranges from 1 to 7 for those who did, indicating the
highest level they reached in the game.
quality.score (float). This is the average word persistence (over a
6 revision window) over all edits made by this userid.
Our measure of word persistence (persistent word revision per word)
is a measure of edit quality developed by Halfaker et al. that
tracks how long words in an edit persist after subsequent revisions
are made to the wiki-page. For more information on how word
persistence is calculated, see the following paper:
Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John
Riedl. 2009. “A Jury of Your Peers: Quality, Experience and
Ownership in Wikipedia.” In Proceedings of the 5th International
Symposium on Wikis and Open Collaboration (OpenSym '09),
1–10. New York, New York: ACM
Press. doi:10.1145/1641309.1641332.
Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence
How we created twa.RData
The files twa.RData combines datasets drawn from three places:
A dataset created by Wikimedia Foundation staff that tracked the
details of the experiment and how far people got in the game.
The variables userid, sample.date, treat, play, and twa.level were
all generated in a dataset created by WMF staff when The Wikipedia
Adventure was deployed. All users in the sample created their
accounts within 2 days before the date they were entered into the
study. None of them had received a Teahouse invitation, a Level 4
user warning, or been blocked from editing at the time that they
entered the study. Additionally, all users made at least one edit
after the day they were invited. Users were sorted randomly into
treatment and control groups, based on which they either received
or did not receive an invite to play The Wikipedia Adventure.
Edit and text persistence data drawn from public XML dumps created
on May 21st, 2015.
We used publicly available XML dumps to generate the outcome
variables, namely edits.all, edits.ns0, edits.talk and
quality.score. We first extracted all edits made by users in our
sample during the six month period since they joined the study,
excluding edits made to user pages or user talk pages using. We
parsed the XML dumps using the Python based wikiq and
MediaWikiUtilities software online at:
We obtained the XML dumps from: https://dumps.wikimedia.org/enwiki/
A list of edits made by users in our study that were subsequently
deleted, created on August 3rd, 2015.
The WMF staff created a dataset that listed all the edits made by
users in our study that were deleted before August 3rd, 2015. We
made the decision to include these edits in our counts, so as to
measure the total level of participation undertaken by each
editor. If a user in our study made article or talk page edits that
were subsequently deleted, we would use the deleted edit logs to
identify them, and increment the variables edits.all, edits.ns0,
and edits.talk as appropriate. We decided that all edits drawn from
the deleted edit logs would be defined to have an edit persistence
score of 0, since they were deleted from Wikipedia.
We “manually” merged these datasets together.
Contact Us
For more details about the dataset, please see our paper.
If you notice any bugs or issues with these data or code, please
contact Sneha Narayan (snehanarayan@u.northwestern.edu) or the
other authors of this paper.
|