For many years Wikipedia’s editor gender gap has been widely discussed, but its content gender gap has received less attention. This summer we presented our work in developing Wikidata Human Gender Indicators (WHGI) at OpenSym ‘16 which provides statistical insight into the composition of Wikipedia biographies through the use of Wikidata. WHGI has allowed us to research details about the character of the biography gender gap—that it is increasingly looking like the political biases of the real world—and to arm community editing groups with metrics about their work. For instance we are providing the data that allows Wikiproject Women in Red to reflect that, “[…] in November 2014, just over 15% of the English Wikipedia’s biographies were about women. Since then, we have improved the situation slightly, bringing the figure up to 16.52%, as of 9 October 2016. But that means, according to WHGI, only 232,357 of our 1,406,482 biographies are about women.”
It wasn’t until the creation of Wikidata that we were able measure Wikipedia’s content deeply and reliably. Created in 2012 Wikidata is the database in which Wikipedia editors can store machine-readable facts that can be shared across all language versions. That is, now we know which Wikipedia articles are about humans, what their gender might be, which countries they we’re born in, and which other Wikipedia languages have articles about them. With that information two possibilities emerged, we could (1) relate Wikipedia and Wikidata’s biases against real world measures and (2) keep track of how the genders (and other descriptions) of biographies in Wikipedias are changing.
WHGI, when disaggregated by country can be seen as another gender disparity indicator, like the United Nation’s Gender Development Index (GDI) which measures gender disparities by country (e.g. earned income, and life expectancy). Since with Wikidata we can associate biographies with countries, we can match the format of ranking countries by their gender disparities—in this case the percentage of female of biographies. We correlated WHGI’s by-country indicator with the U.N.’s GDI and three other popular external indexes (see table right). By finding which external indexes matched the most and least with WHGI we we’re able to determine that inclusion in Wikipedia and Wikidata is more correlated to political power in society than it is to health and educational issues. Additionally the strengths each of these correlations have been increasing since we first started measuring them on a weekly basis in 2014. This means that the gender disparities found in WHGI by-country are increasingly looking more like these real world gender disparities. To further verify our conclusion we also showed that by accounting for the occupations of the biographies WHGI is correlated to U.S. Bureau of Labour Statistics’ data. WHGI appears to be a robust measure of gender disparity to be added to the landscape of gender disparity measures.
The fact that exogenous validations show that Wikipedias’ gender disparities are increasingly reflecting the real world suggest that Wikipedia is “catching up” to real world disparities. Further, we demonstrated that this catching-up is also correlated to the activity levels of women-focused editing initiatives. That is something that Women in Red and other gender-focused editors ought to celebrate. In fact because WHGI utilizes Wikidata—which is Wikipedia-language agnostic—we are able to provide statistics for all 280+ Wikipedias. Encouragingly we’re able to see efforts in Nepali, Maithili and other small languages highlighted through the tool (see figure above). For instance in the past week Maithili added 40 biographies to their Wikipedia, 22 of which we’re about women.
We hope that many researchers and groups are able to use WHGI in their work, as Women in Red have already been doing. Of course the data and paper are freely available at http://whgi.wmflabs.org/.