Investigating the Gender Pronoun Gap in Wikipedia
In recent years there have been many studies investigating gender biases in the content and editorial process ofWikipedia. In addition to creating a distorted account of knowledge, biases in Wikipedia and similar corpora have especially harmful downstream eects as they are often used in Artificial Intelligence and Machine Learning applications. As a result, many of the algorithms that are deployed in production “learn" the same biases inherent in the data that they churned. It is the therefore increasingly important to develop quantitative metrics to measure bias. In this study we propose a simple metric, the Gendered Pronoun Gap, that measures the ratio of the occurrences of the pronoun “he" versus the pronoun “she." We use this metric to investigate the distribution of the Gendered Pronoun Gap in two Wikipedia corpora prepared by Machine Learning companies for developing and benchmarking algorithms. Our results suggests that the way these datasets have been produced introduce dierent types of gender biases that can potentially distort the learning process for Machine Learning algorithms. We stress that while a single metric is not sucient to completely capture the rich nuances of bias, we suggest that the Gendered Pronoun Gap can be used as one of many metrics.