For many years Wikipedia’s editor gender gap has been widely discussed, but its content gender gap has received less attention. This summer we presented our work in developing Wikidata Human Gender Indicators (WHGI) at OpenSym ‘16 which provides statistical insight into the composition of Wikipedia biographies through the use of Wikidata. WHGI has allowed us to research details about the character of the biography gender gap—that it is increasingly looking like the political biases of the real world—and to arm community editing groups with metrics about their work. For instance we are providing the data that allows Wikiproject Women in Red to reflect that, “[…] in November 2014, just over 15% of the English Wikipedia’s biographies were about women. Since then, we have improved the situation slightly, bringing the figure up to 16.52%, as of 9 October 2016. But that means, according to WHGI, only 232,357 of our 1,406,482 biographies are about women.”

Homepage of WHGI
Homepage of the WHGI website. WHGI is a service that monitors the gender of biographies across all Wikipedias for community metrics and research purposes.


It wasn’t until the creation of Wikidata that we were able measure Wikipedia’s content deeply and reliably. Created in 2012 Wikidata is the database in which Wikipedia editors can store machine-readable facts that can be shared across all language versions. That is, now we know which Wikipedia articles are about humans, what their gender might be, which countries they we’re born in, and which other Wikipedia languages have articles about them. With that information two possibilities emerged, we could (1) relate Wikipedia and Wikidata’s biases against real world measures and (2) keep track of how the genders (and other descriptions) of biographies in Wikipedias are changing.

A heatmap of the percentage of female biographies by country of birth or citizenship, “all time” as of October 9th 2016. India is highlighted as it has been an area for growth on this measure (see language-changes image below).


The four exogenous indices we used were: The traditional United Nations’ Gender Development Index (GDI) 8 which considers dis- parity in income, education, and life expectancy. Social Watch’s Gender Equity Index (GEI) 9 tries to broaden the scope of the vari- ables by not only incorporating education and economic participa- tion, but also by stretching into economic and political empower- ment. The Global Gender Gap Index (GGGI) 10 grows yet wider by covering all previous topics but with more detail. And most recently the Social Institutions and Gender Index (SIGI) 11 has at- tempted to capture disparity in norms, values and attitudes.
The four exogenous indexes we used were: The traditional United Nations’ Gender Development Index (GDI),  Gender Equity Index (GEI), the Global Gender Gap Index (GGGI), and the Social Institutions and Gender Index (SIGI).

WHGI, when disaggregated by country can be seen as another gender disparity indicator, like the United Nation’s Gender Development Index (GDI) which measures gender disparities by country (e.g. earned income, and life expectancy). Since with Wikidata we can associate biographies with countries, we can match the format of ranking countries by their gender disparities—in this case the percentage of female of biographies. We correlated WHGI’s by-country indicator with the U.N.’s GDI and three other popular external indexes (see table right). By finding which external indexes matched the most and least with WHGI we we’re able to determine that inclusion in Wikipedia and Wikidata is more correlated to political power in society than it is to health and educational issues. Additionally the strengths each of these correlations have been increasing since we first started measuring them on a weekly basis in 2014. This means that the gender disparities found in WHGI by-country are increasingly looking more like these real world gender disparities. To further verify our conclusion we also showed that by accounting for the occupations of the biographies WHGI is correlated to U.S. Bureau of Labour Statistics’ data. WHGI appears to be a robust measure of gender disparity to be added to the landscape of gender disparity measures.

Differences in total biographies and percentage female biographies in each Wikipedia language between October 9th and 16th 2016. Maithili, a language spoken in Nepal and India has consistently shown biography additions that are closing their biography gender gap.


The fact that exogenous validations show that Wikipedias’ gender disparities are increasingly reflecting the real world suggest that Wikipedia is “catching up” to real world disparities. Further, we demonstrated that this catching-up is also correlated to the activity levels of women-focused editing initiatives. That is something that Women in Red and other gender-focused editors ought to celebrate. In fact because WHGI utilizes Wikidata—which is Wikipedia-language agnostic—we are able to provide statistics for all 280+ Wikipedias. Encouragingly we’re able to see efforts in Nepali, Maithili and other small languages highlighted through the tool (see figure above). For instance in the past week Maithili added 40 biographies to their Wikipedia, 22 of which we’re about women.
We hope that many researchers and groups are able to use WHGI in their work, as Women in Red have already been doing. Of course the data and paper are freely available at

Written by

Hello, I am Maximilian Klein, Computer Science PhD student, Feminist, Yogi, Fuzzy Logician, and Vegan.

Comments are closed.