The English version of Wikipedia contains over 6.5 million articles… but only 0.09% of them have received Wikipedia’s highest quality rating. In other words, there’s still a lot of work to be done.
But where to start?
A group of highly experienced Wikipedia editors tried to answer that question. Through extensive discussion and consensus-building, they manually compiled lists of Vital Articles (VA) that should be prioritized for improvement. We analyzed their discussions to try to identify valuesthey brought to the table in making those decisions. We found––among other things––a desire for Wikipedia to be “balanced”, including along gender lines. Wikipedia has long been criticized for its gender imbalance, so this was encouraging!
But how is this value reflected in the actual prioritization decisions in the lists of Vital Articles these editors developed?
Not so much.
Figure 4 from our paper shows what would happen if editors were to use Vital Articles to prioritize work on biographies: the proportion of highest quality biographies about women would decrease––from 15.4% to 14.7%. By contrast, using pageviews (which indicate reader interest) to prioritize work would result in an increase in the proportion to 21.4%.
In short, if you want more gender balance, just prioritize what readers happen to read––not what a devoted group of editors painstakingly curated over several years with gender balance as one of the goals in mind. So what gives? Are Wikipedians just pretending to care about gender balance?
As it happens, only 7.5% of VA’s participants self-identify as women. For reference, that figure was 12.9% on all of English Wikipedia at the time we collected our data. Prior work gives plenty of evidence to help explain why a heavily male-skewed group of editors might have failed to include enough articles about women despite good intentions. Some of the reasons are quite intuitive too; as one Wikipedian put it, “On one hand, I’m surprised [the Menstruation article] isn’t here, but then as one of the x-deficient 90% of editors, I wouldn’t have even thought to add it.”
The takeaway: when it comes to prioritizing content, skewed demographics might prevent the Wikipedia editor community from fully enacting its own values. However, this effect is not the same for all community values; we find that VA would actually be a great prioritization tool for increasing geographical parity on Wikipedia. As for why? We have some ideas…
But for more on that (and other cool findings from our work), you’ll have to check out our research paper on this topic––coming to CSCW 2022! You can find the arXiv preprint here.
I sat in a gray cubicle, next to a social worker deciding whether to investigate a young couple reported for allegedly neglecting their one-year-old child. The social worker read a report aloud from their computer screen: “A family member called yesterday and said they went to the house two days ago at 5pm and it was filthy, sink full of dishes, food on the floor, mom and dad are using cocaine, and they left their son unsupervised in the middle of the day. Their medical and criminal records show they had problems with drugs in the past. But, when we sent someone out to check it out, the house was clean, mom was one-year sober and staying home full-time, and dad was working. But, dad said he was using again recently.” The social worker scrolled down past the report and clicked a button; a screen popped up with “Allegheny Family Screening Tool” at the top and a bright red, yellow, and green thermometer in the middle. “The algorithm says it’s high risk.” The social worker decided to investigate the family.
Workers in Allegheny County’s Office of Children, Youth, and Families (CYF) have been making decisions about which families to investigate with the Allegheny Family Screening Tool (AFST), a machine learning algorithm which uses county data including demographics, criminal records, public medical records, and past CYF reports to try to predict which families will harm their children. These decisions are high-stakes: An unwarranted Child Protective Services (CPS) investigation can be intrusive and damaging to a family, as any parent of a trans child in Texas could tell you now. Investigations are also racially disparate: Over half of all Black children in the U.S. are subjected to a CPS investigation, twice the proportion for white children. One big reason why Allegheny County CYF started using the AFST in 2016 was to reduce racial biases. In our paper, How Child Welfare Workers Reduce Racial Disparities in Algorithmic Decisions, and its associated Extended Analysis, we find that the AFST gave more racially disparate recommendations than workers. In numbers, if the AFST fully automated the decision-making process, 68% of Black children would’ve been investigated and only 50% of white children from August 2016 to May 2018, an 18% disparity. The process isn’t fully automated though: the AFST gives workers a recommendation, and the workers make the final decision. Over that same time period from 2016 to 2018, workers (using the algorithm) decided to investigate 50% of Black children and 43% of white children, a lesser 7% disparity.
This complicates the current narrative about racial biases and the AFST. A 2019 study found that the disparity between the proportions of Black and white children investigated by Allegheny County CYF fell from 9% before the use of the AFST to 7% after it. Based on this, CYF said that the AFST caused workers to make less racially disparate decisions. Following these early “successes,” CPS agencies across the U.S. have started using algorithms just like the AFST. But, how does an algorithm giving more disparate recommendations cause workers to make less disparate decisions?
Last July, my co-authors and I visited workers who use the AFST to ask them this question. We showed them the figure above and explained how the algorithm gave more disparate recommendations and that they reduced those disparities in their final decisions. They weren’t surprised. Although the algorithm doesn’t use race as a variable, most workers thought the algorithm was racially biased because they thought it uses variables that are correlated with race. Based on their everyday interactions with the algorithm, workers thought it often scored people too high if they had a lot of “system involvement,” e.g. past CYF reports, criminal records, or public medical history. One worker said, “if you’re poor and you’re on welfare, you’re gonna score higher than a comparable family who has private insurance.” Workers thought this was related to race because Black families often have more system involvement than whites.
The primary way workers thought they reduced racial disparities in the AFST was by counteracting these patterns of over-scoring based on system involvement. A few workers we talked with said they made a conscious effort to reduce systemic racial disparities. Most, however, said reducing disparities was an unintentional side effect of making decisions holistically and contextually: Workers often looked at parents’ records to piece together the situation, rather than as an automatic strike against the family. For example, in the report I mentioned at the top of this article, the worker looked at criminal and medical records only to see if there was evidence that the parents abused drugs. The worker said, “somebody who was in prison 10 years ago has nothing to do with what’s going on today.” Whether they acted intentionally or not, workers were responsible for reducing racial disparities in the AFST.
We’re at a university where an employee receives 27 bulk emails from the organization (the untargeted and unpersonalized emails sent to a large list of recipients) — each of them contains over 8 messages on average. That means an employee receives over 250 unique pieces of content per week from central units (e.g. president office, provost office) — not from their students and their peers, but from the communicators in central units. By inputting a mailing list and pressing a button, a communicator could send an email to over 20,000 employees (see figure 1).
Figure 1. We found that the burden of being aware was put collectively on recipients.
The current organizational bulk email system is not an effective system. For one, these bulk emails are not free — imagine that each employee spent 2 min reading a bulk email, with average rates, this email will cost 20,000 * 2 min * 0.5 $/min = $20,000 to the university. But of course, this cost isn’t paid by the sender, it is absorbed by all the departments where staff work. So the sender thinks the message is free, yet each department and unit has its employees’ time taken away bit by bit by these messages.
For another, these bulk emails are not being remembered. Through a survey with 11 real bulk messages sent in the last two weeks and 11 fake bulk messages, we found that the real bulk messages only had about a 22% gain over messages that were not sent to them (38% of the real messages were recognized while 16% of the fake messages were also claimed “seen” on average).
So we carried out a study to examine current practices and experiences of the stakeholders of this system. We conducted artifact walkthroughs with six communicators and nine recipients within our university. We also interviewed two of the managers of those recipients. Specifically, in these artifact walkthroughs, the recipients walk the interviewer through the previous email messages they received; the communicators walk us through the previous email messages they sent out.
We found that:
First, recipients are burdened, they feel the responsibility of awareness was shifted to them by those bulk messages.
Second, naturally, stakeholders have different preferences. The leaders and managers think that employees should know what’s going on in the university – however, the employees feel that most of these messages are too high-level to be relevant.
However, on another side, communicators have to send these emails even when they know that their recipients will dislike them — they have their own difficulties.
First, they have clients–organizational leaders–who want everyone to know their messages.
What’s more – communicators lack tools, they have very limited tools for targeting/personalizing emails; e.g., they could only target people by job code — however, a general title like program associate tells you nothing about this employee’s job content.
Most important, the system appears to work because these emails have good open rates. Open rates are nearly the only metric in the current bulk email tech platform because it is easier to get than end-to-end metrics like recognition rate or reading time. However, most of our recipients read the first line, then close the email — simply because they can’t get enough information from the subject line. In other words, we should not confuse a message that people open with one that actually contains content they find useful.
Figure 2. Summarization of our findings.
To summarize, none of the stakeholders has a global view of the system and sees the costs of the current bulk email system to the organization (see figure 2). We’re working on a following-up project to provide possible solutions to improve this system.
“I enjoyed writing. Perhaps it was because I hardly heard the sound of my own voice. My written words were my voice, speaking, singing, … I was there on the page” – Jenny Moss
We have a natural desire for expressive writing to hear the voice deep inside ourselves in difficult times. Although previous studies have proven therapeutic effects of expressive writing, most of them studied the activity in controlled labs where the writing was guided by a researcher. We think that therapeutic expressive writing happens spontaneously in the real world as well. We focus on spontaneous expressive writing on CaringBridge, an online platform for people to write and share their or their beloved ones’ health journeys and get support. Our goal was developing a computational model to infer whether a post does or does not contain expressive writing in order to help people get more benefit from using online health communities like CaringBridge.
One major challenge we encountered to achieve this goal is that there is no past data on therapeutic expressive writing in the wild. To address this challenge we thought about how we could adapt expressive writing data that was collected in the lab. We looked at 47 past lab studies and what they could tell us about expressive writing. Turns out that the writing that was counted as “expressive” in these studies, shared some common characteristics: it used emotion and cognitive words a lot more than the writing that was not “expressive”. We used a clever statistical model (more details in the paper) to look at each CaringBridge post and tell us how much it matched those characteristics. The research team also looked at 200 posts ourselves to see how often our model would come to the same conclusion as the research team as to whether a blog post constituted “expressive writing”. We agreed about 67% of the time, so there’s obviously a lot of room for improvement (we assume that humans are generally right and that the algorithm needs to improve how well it recognizes these posts).
Despite the limitations of the model, it provides the first ever opportunity to understand how often expressive writing may occur in the wild. We applied our model to the dataset of 13 million CaringBridge journal posts and inferred 22% were expressive and 78% were not expressive. This provides evidence for spontaneous expressive writing in the wild.
To sum up, our paper has three contributions. First, it demonstrates a way to use aggregated empirical data. In cases where no data are available, we could use common characteristics reported in past studies to study the group we are interested in, as we did in the paper. Second, it provides a baseline model to infer expressive writing and to be improved upon. Future research could use more sophisticated features and models by constructing a gold standard dataset or transferring knowledge from a related task that has already been learned. Third, it identifies expressive writing as a potential measure for online health communities. How much an individual engages with spontaneous expressive writing not only reveals their current writing practices, but also reveals the difficult times they are going through. Online health communities can then target their messaging by sending emotional support to those in difficult times and providing writing tips to those who are less expressive so that people can gain the most benefits from their writing.
Wikipedia is the online encyclopedia that anyone can edit. However, you probably didn’t know that “bots” (software tools) also edit Wikipedia! Human editors (“Wikipedians”) work together with bots to keep Wikipedia up to date. However, bots’ edits can conflict with each other; some have even written about bot editing wars “raging on Wikipedia’s pages”! But is this true? If it were really happening, bot conflict would be a big deal: bots automatically enforce the encylopedia’s rules, so identifying bot conflict can help Wikipedians refine editing processes. However, previous researchers have used overly-simple approaches to quantify conflict, ignoring the context of specific bot edits. To understand what bots were actually doing, we conducted a qualitative analysis of the context in which bots make edits. We found no evidence of bot conflict, though we did find some malfunctioning bots.
We are Abby Newcomb (St. Olaf College) and Sokona Mangane (Bates College), participants in the University of Minnesota’s 2021 Computer Science REU. In this blog post, we’re going to talk about the editing patterns of four bots on Wikipedia: AvicBot, Cyberbot I, RonBot, and AnomieBOT. Using these bots as a case study, we will show that what may appear to be conflict is routine and expected when examined in context.
How do bots edit Wikipedia?
Bots are automated or semi-automated software agents programmed to carry out various tasks. According to the Wikipedia page Wikipedia:Bots, bots carry out tasks that are “repetitive and mundane” in order to maintain the encyclopedia. Bots adhere to the Wikipedia bot policy and are approved by human editors in the Bot Approvals Group before they are allowed to edit any Wikipedia pages. Most bots do not make edits on actual encyclopedia articles, but take care of housekeeping tasks necessary to keep the encyclopedia running.
Each time a user (human or bot) makes a change to a Wikipedia page, other users have the option to undo that change. An edit undoing or reversing the edits of another user, partially or completely, is called a revert. This type of interaction between users is interesting because the revert could indicate conflict: a disagreement over what’s included on a page. However, the original edit could have been a mistake, the original edit could be a temporary one meant to be reverted later, or Wikipedia practices could have changed and rendered the original edit unnecessary. Determining if a revert actually indicates conflict can be difficult.1
Bot-bot reverts—when a bot edit is reverted by another bot—are common on Wikipedia. A number of routine processes that bots do demand that they revert each other’s edits.2 However, the research paper “Even Good Bots Fight” by Tsvetkova et al. considers reverts to be a strong indicator of conflict. Their study concluded that the many cases of bots reverting each other indicates Wikipedia’s lack of control over its bots and led to media coverage of raging bot wars. Geiger and Halfaker’s replication study critiqued the association of bot reverts as necessarily indicating bot conflict. Through a mixed-methods approach, they argue that bot-bot reverts are primarily routine work, with the vast majority of bots acting as intended and in collaboration with each other. We build on their research by inspecting how four of the most prolific bots use reverts to interact with both bot and human editors. By studying these bots and their reverted edits, we also found that reverts didn’t indicate conflict, which indicates the importance of looking at reverts in context.
To conduct our analysis, we looked at edits made to English Wikipedia in the first three weeks of January 2019. We looked at a random sample of 10 edits for each bot, using our conclusions and other summary figures to choose samples for further analysis.3 In our study, we considered a revert to be any edit that causes the page to be an exact match of a previous version of the page within 24 hours. Thus we only look at reverts that completely remove the content of the original edit. By this definition, multiple edits can be undone by a single revert; in fact, 71% of reverts undo multiple edits at once. We define a self-revert as a revert where the original edit and the reverting edit were made by the same user. In the next four sections, we’ll dive into four different bots as case studies to understand whether bots’ reverts indicate conflict.
One of the bots that we looked at extensively is AvicBot, which is among the top 5 self-reverters. AvicBot is run by the user Avicennasis and has been operating since 2011.4 The bot has not made an edit since June 26, 2020 and was officially marked inactive on April 20, 2021. AvicBot performed a total of 11 tasks, including but not limited to: maintaining interwiki links, fixing redirects, tagging certain pages, maintaining a list of certain categories and more. Based on our analysis, AvicBot appears to be primarily a “listifying” bot because it maintains several tracking categories.5
AvicBot reverts itself while doing its routine listifying work. For example, when we look at this revision and its corresponding revert, the edit summary indicates that the revision is creating a list from a category of pages flagged for deletion. Another user was added to the page and 45 minutes later, after several intervening edits, the revision that reverted the original edit removed 4 entries, including the one added during the original edit. As we can see here, multiple of AvicBot’s edits were undone by a single revert. The bot is self-reverting here to periodically update the category its tracking!
As many of AvicBot’s tasks involve maintaining tracking categories, we can infer that most of AvicBot’s edits will probably look similar to the edits we looked at. All of AvicBot’s reverted edits are self-reverts, because it’s constantly updating several categories and reverting its own revisions. Thus we have concluded that the vast majority of AvicBot’s self-reverts are routine work and do not indicate conflict
Cyberbot I has the highest percentage of reverted edits of the 4 bots, with 13% reverted edits. Cyberbot has been running since April of 20126 and remains active as of July 2021. The bot is maintained and operated by Wikipedia user Cyberpower678.7 Based on our data, Cyberbot I’s primary task appears to focus on updating various tables of statistics to keep Wikipedians updated of activity on the encyclopedia.8
97% of Cyberbot I’s edits are self-reverted: why? To find out, we selected at random 20 self-reverted edits. 18 of those edits were on the Cyberpower678/Tally page, which Cyberbot I appears to use to keep track of the current number of votes in RfA and RfB discussions, though the purpose of the page is not clearly stated. The edits present as reverts because the bot seems to repeatedly delete its own content from the page, then add the same content again. The other 2 edits in the self-reverted sample were on the RfX Report page, where Cyberbot I also deletes its own content just to add it again seconds later in order to update the page. These edits are not problematic because the bot appears to be doing its job and functioning as-intended. Not all of Cyberbot I’s reverted edits are due to self-reverts: 26 edits were reverted by humans. The vast majority of these edits did not seem problematic and were often related to Cyberbot’s Sandbox cleaning task.9
Out of all Cyberbot I’s edits, we found 7 edits that are possibly problematic, either as malfunctions of the bot or disagreements over what the bot should be doing. The first problematic edit occurred when Cyberbot I updated a user’s admin stats but the edit count went down instead of up, which is impossible. This edit appeared to go unnoticed for the 24 hours it was visible. Cyberbot I also added an Articles for Deletion template to an article whose discussion had already closed, which was reverted by a human an hour later. The bot also deleted all content on the Changing username/Simple pagetwice, which was reverted by a human in about 2 hours each time. In a problematic sequence of 6 edits on the Template:RfX tally page, human users attempt to change the page and are reverted by the bot each time, reverting Cyberbot I in turn until a human moves the page to circumvent the bot. Overall, these represent the only instances of potential conflict we identified involving Cyberbot I; the vast majority of its edits are productive contributions.
RonBot was frequently reverted by humans in our sample, coming in 3rd place in the list of bots most reverted by humans with 429 edits being reverted. It was run by Ronhjones. Due to the passing of its operator, the bot was recently deactivated and retired. We wanted to understand why RonBot was reverted by humans so often.
When we looked at 20 random edits reverted by humans, we found that 90% of these edits appeared to be caused by a malfunction of the bot. It appears that RonBot was adding articles to the “American footballers with no declared position” maintenance category10. However, users noticed that most of these footballers already had a position category listed in their articles, so these edits were reverted. We can see from the figure below that RonBot made the most edits on January 7, 87% of which were reverted. Of the 87% reverted edits, 20% of those were done by humans. Based on our qualitative analysis, it’s clear that RonBot was malfunctioning.
When a bot malfunctions, non-admin users can report them on this page. Users did report the bot’s malfunction, requesting that it be blocked or deactivated temporarily until the issue was resolved. About 2-3 days later, Ronhjones seemed to be working on repairing the bot. When we looked at another random sample of RonBot’s edits, the bot was removing this category11 from multiple pages (a number of which have been added on January 7th or months before) and fixing edits made from its malfunction, on January 8th, 16th, 19th. As we can see on the graph, after January 7th, these three dates have the highest number of total edits per day. These edits weren’t considered self-reverts, as one can see from the graph, because RonBot reverted them after 24 hours. In this special case, reverts were crucial to identifying this malfunctioning bot, and indicated malfunction rather than conflict. We suggest that a clearer reporting mechanism may be useful for improving bot governance.
AnomieBOT12 ranked 3rd on the list of bots frequently reverted by humans, so for this bot we primarily looked at article edits reverted by a human in order to understand the cause of these reverts. AnomieBOT has been editing Wikipedia since 2008 and is still active as of July 2021, operated by the user Anomie.
In a sample of 20 edits in the article namespace that were reverted by humans, we came across examples of 3 distinct tasks, as defined by the bot’s edit summary: dating maintenance tags,13 rescuing orphaned references,14 and fixing reference errors.15 None of these tasks are controversial in any way, since they are all routine maintenance. So why are these edits being reverted?
We consider at least 75% of these reverts to be caused by human conflict or human error. The sequence of events often starts with a human making an edit that others don’t like, perhaps adding incorrect or poorly formatted information. The bot then does its job and tries to help this first human by fixing some of the reference errors or adding a date to a maintenance tag. Later, a second human editor comes along and sees the mess created by the first human, and chooses to revert that human’s edit along with AnomieBOT’s edit. The fault lies with the first human: the bot’s edit is irrelevant to the human’s decision to revert. Thus, AnomieBOT is frequently reverted along with other edits made by humans because a human made a controversial edit and AnomieBOT was just caught in the crossfire of a human disagreement.
In the article namespace, 93% of AnomieBOT’s edits reverted by humans were reverted at the same time as a human edit. Because of this statistic and our qualitative observations, we believe that AnomieBOT is frequently reverted not because of conflict between the bot and humans, but because of conflict between humans and other humans. AnomieBOT is not in conflict with any other users.
In this blog post, we showed that reverts generally don’t indicate bot conflict. AvicBot and Cyberbot I both reveal that routine operation can involve self-reverting. RonBot was malfunctioning, which most people wouldn’t consider to be conflict. AnomieBOT reveals that just because a bot is being reverted doesn’t mean it’s involved in conflict; it may just be getting in the way of two human editors’ conflicts! Our research suggests that people attempting to quantify bot conflict need to develop more sophisticated methods than just counting reverts.
This research would not have been possible without the help of our mentors: Professor Loren Terveen and soon-to-be-PhD Zachary Levonian in the GroupLens Lab at the University of Minnesota. This work was also presented at the UMN Virtual Poster Symposium. Code for this work is available on GitHub.
When reverting another user, most human editors will leave an edit summary to indicate why they made the change. An edit summary is a short explanation of an edit that is visible in the article’s edit history, shown in the top panel of the image below. Bots are also required by the bot policy to leave descriptive edit summaries, but their edit summaries are pre-programmed in their code. Thus if the bot is malfunctioning, the edit summary may not match what they’re actually doing.
For example, DatBot reports users who break community guidelines. Meanwhile, HBC AIV helperbot 5 checks if reported users have been blocked, and if they are, the bot removes the entry. Hence, reverting DatBot’s edits is a part of HBC AIV helperbot5’s job.
For more information on our sample, we have provided summary information here.
Cyberpower678 also operates a second bot, Cyberbot II. It appears that all of the code and tasks of Cyberbot I belonged to other bots previously, based on the bot’s approvals and user page, though the operator has rewritten and maintained the code.
For example, Cyberbot I updates a separate Adminstats page for any user who requests these statistics. Other examples of tables maintained by the bot are the RfX Report page, which tracks any current discussions about requests for adminship or bureaucrat status, and the Requests for Unblock table, which keeps track of Wikipedians who would like to be unblocked from editing. In addition to maintaining these statistics pages, Cyberbot I clears various sandbox pages, which provide a space for users to experiment with editing tools without damaging Wikipedia articles; maintains several discussion pages, including Articles for Deletion and Changing username/Simple; and creates the current events page featured on the Wikipedia Main Page every day. Many more tasks are listed on Cyberbot I’s user page.
The Sandbox functions as a sort of whiteboard, where Wikipedians can test out their editing skills as they wish. Later, a bot will come wipe the whiteboard clean so that the next person arrives to a clear editing space. Some Wikipedians like to keep their content in the Sandbox for a while though, so will revert Cyberbot to restore their content to the Sandbox.
A maintenance category is specifically used so that Wikipedia contributors know that a given article needs some kind of maintenance. These categories are not visible in the article page, but must be included on the source code of each article. The “American footballers with no declared position” category is presumably used so that contributors will come to the article and add it to a position category, such as “Association football forwards.”
The “American footballers with no declared position” category
This bot operates out of 5 different accounts, each with various privileges and edit spaces, but our analysis was focused on the main AnomieBOT account.
Also called maintenance templates, these tags allow editors to leave messages for others about problems with a given article. The tags can be dated so that editors know how long they have been on the page. An old tag may be deleted if it becomes out of date or no longer relevant to the article. AnomieBOT adds dates to these maintenance tags so that humans know when they were added to the article.
The Wikitext language used to write Wikipedia articles requires the use of a reference template in order to cite information. References can be named so that they can be used multiple times in a document without having to copy the source information multiple times. An orphaned reference is a reference which has a name, but no accompanying reference information in the article. AnomieBOT attempts to recover information about the reference from the page history and add it to the article.
Similar to rescuing orphaned references, reference errors are often caused by issues with the reference template required by Wikitext, the language used to write Wikipedia articles. When a human makes a reference error, AnomieBOT recognizes these errors and attempts to fix them so that the article has fewer problems.