Analyzing Gender and Intersectionality in Machine Learning| Gendered Innovations

Machine learning (ML) technologies—including risk scoring, recommender systems, speech recognition and facial recognition—operate in societies alive with gender, race and other forms of structural discrimination. ML systems can play a part in reinforcing these structures in various ways, ranging from human bias embedded in training data to conscious or unconscious choices in algorithm design. Understanding how gender, sex, ethnicity and other social factors operate in algorithms in particular contexts can help researchers make conscious and informed decisions about how their work impacts society.

1. Analyze the social objectives and implications of the work
Teams need to work together to define the objectives of the system being developed or researched and to evaluate its potential social impacts, guided by the following questions:

a. How are the priorities and objectives of the technology set? Who will it benefit, and who will be left out or put at risk (Gurses et al., 2018)?
b. How do gender norms shape the research/product/service/system (Leavy, 2018)?
c. How might structural racism shape the work, e.g. with racial stereotypes being reproduced in search algorithms (Crenshaw, 1991; Noble, 2018)?
d. How do wealth disparities shape the work, e.g. will the system be deployed against or exclude those in poverty (Eubanks, 2018)?
e. How does environmental sustainability shape the work, e.g. where training a deep learning model has a large carbon footprint (Strubell et al., 2019)?

2. Analyze the data for potential biases
Each training set should be accompanied by information on how data were collected and annotated. Data containing human information should include metadata summarizing statistics on factors such as participant gender, sex, ethnicity, age and geographic location (Gebru et al., 2018; MIT Media Lab, 2018). Data labelling done via crowdsourcing, such as MS Turk, should include information about crowd participants, along with the instructions given for labelling (Zou & Schiebinger, 2018).

a. Do the data sets used to train and test the model embed human bias? See, e.g., Buolamwini & Gebru (2018).
b. Analyze the contexts in which the data were generated—what are the known gender biases and norms which may influence the data-generating process? See, e.g., Shankar et al., 2017. Some features may have a different interpretation or impact when presented by members that use them to signal identification with a self-defining social group (e.g. the deliberate socially-affirmative use of words that are considered offensive according to more generally prevailing social norms).
c. It is also possible to introduce bias during the data preparation stage, which involves selecting which attributes you want the algorithm to consider. First, consider whether it is appropriate for an algorithm to use gender (e.g. if it will be used to advertise clothing, which is often gender-specific) or not (e.g. if it will be used to advertise jobs, which need to be equally visible to all genders). If inappropriate, consider removing explicit gender attributes. In addition, consider whether any of the features selected for inclusion in the model are likely to be distributed unequally among different genders? For instance, if the model uses features which might correlate with gender, like occupation or ‘single parent’ status, it may still indirectly use gender.
d. It is also possible to embed bias in the objective functions and/or the heuristics guiding search-based algorithms. For instance, using health costs as a proxy for health needs in the formulation of an objective function could lead to racial bias in cases where patients from one racial group have less money allocated to them (Obermeyer et al., 2019).
e. Does the training data cover a diverse enough sample of the population to ensure any models trained on it will perform well across populations? Is your data balanced for gender, ethnicity, geographic location, age and other social factors?

3. Analyze model for fairness

a. Have you formulated an appropriate definition of fairness and tested the ML model against it?
Where an ML model will be used to determine significant outcomes for people, and where legal constraints or socially agreed norms are clear, it may be appropriate to formulate a quantifiable measure of fairness. This can be used to test the model itself and detect whether its outputs might have an unfair disparate impact by gender (Dwork et al., 2012; Hardt et al., 2016; Corbett-Davies et al., 2017). NB: Even a system that is ‘fair’ according to such measures could still be used in unjust ways if it is deployed on an already marginalized population.
b. Pay attention to intersectionality when assessing potentially discriminatory ML models.
Some measures of fairness involve testing how the model performs on a limited set of protected characteristics (e.g. gender, race) in isolation; however, this could miss intersectional forms of discrimination, where people are at the intersection of two or more forms of discrimination (e.g., “Black women,” “elderly Asian men”). Alternative measures aim to ensure that the algorithm achieves good performance not just in aggregate but also for each of the subpopulations, which could be more granular (Kim, et al., 2019; Tannenbaum et al., 2019).
c. Use the fairness definition to constrain or re-train your ML system.
Having established a fairness definition, use this to audit an existing system or to constrain the development of a new one, so that it conforms to the definition. Various approaches can be used, either to pre-process the data to mitigate biases in the model, to constrain the learning algorithm, or to modify the model after training. Which of these is most practical will depend on context (Bellamy et al., 2018).
d. Have you considered how the model, once deployed, might interact with the broader socio-technological systems in gendered ways?
A model, for instance, designed to predict the optimal bid for advertising space in a personalized auction might appear to be gender-neutral in testing. When that system, however, is deployed against competitors who are targeting by gender, the supposed gender-neutral system may end up placing advertisements in a gender-biased way (Lambrecht & Tucker, 2019). If the outputs are used to guide human decision-making, potential biases of the decision-makers might cancel out or over-compensate any de-biasing measures applied to the ML model.

4. Sub-categories of machine learning
While the considerations above are widely applicable, de-biasing approaches will be different depending on the subtype of machine learning in question. The following is a non-exhaustive list of subcategories in which different methods may be needed:

a. Natural language processing:
- • Researchers have proposed ways to analyze and identify gender biases in computational models of language, in particular in word embeddings. Geometry-based techniques allow the comparison of the embedding of masculine or feminine words in relation to other words in a language. These approaches also enable such embeddings to be de-biased, e.g. such that "babysitter" would be equally close to “grandfather” and to “grandmother” (Bolukbasi et al., 2016). Other researchers applied the standard Implicit Association Test to word embeddings to demonstrate that they contain many human-like biases (Caliskan et al., 2017). Still others have developed a gender-neutral variant of GloVe (Zhao et al., 2018). For images, researchers first showed that training algorithms can amplify biases present in the training data and then developed a technique to reduce this amplification (Zhao et al., 2017).
- • Similar techniques can also be used to study historical stereotypes and biases in society at large. Researchers used geometry-based metrics for word embeddings trained on text data over 100 years to identify changes in historical gender and ethnic biases (Garg et al., 2018). Similar methods may be used to compare biases in texts today.
b. Ranking and recommender systems
- • Fairness in ranking deals with situations in which a set of people or items need to be ranked (e.g. candidates for a job interview, or book recommendations) in a fair way. Because measures of quality or relevance might be biased against some—e.g. female candidates or authors—the top-ranked candidates or books might be disproportionately relevant to others—e.g. men. Computational approaches to fair ranking aim to enforce ranked group fairness on these search results or recommendations (e.g. Zehlike et al., 2017).
c. Speech recognition
- • Current speech recognition suffers from challenges similar to those of computer vision (Tatman, 2017); similar care in collecting representative data sets should be applied.
d. Facial recognition
- • NIST’s 2019 report on evaluating commercial face-recognition services found that “Black females...is the demographic with the highest FMR (false match rate)” (NIST, 2019).

Works Cited

Bellamy, R. K., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., ... & Nagar, S. (2018). AI fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv preprint arXiv:1810.01943.

Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems (pp. 4349-4357).

Caliskan, A., Bryson, J., & Narayanan, N. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183-186.

Crenshaw, K. (1991). Mapping the margins: Intersectionality, identity politics, and violence against women of color. Stanford Law Review, 43, 1241.

Corbett-Davis, S., Pierson, E., Feller A., Goal S., & Huq A. (2017). Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 797-806). ACM.

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference (pp. 214-226). ACM.

Eubanks, V. (2018). Automating inequality: how high-tech tools profile, police, and punish the poor. New York: St. Martin's Press, 2018.

Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635-E3644.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumeé III, H., & Crawford, K. (2018). Datasheets for datasets. arXiv preprint arXiv:1803.09010.

Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (pp. 3315-3323).

Grother, P. J., Ngan, M. L., & Hanaoka, K. K. (2019). Ongoing Face Recognition Vendor Test (FRVT) part 1: verification (No. NIST Interagency/Internal Report (NISTIR)-8238). https://www.nist.gov/sites/default/files/documents/2019/07/03/frvt_report_2019_07_03.pdf

Gurses, S., Overdorf, R., & Balsa, E. (2018). POTs: The revolution will not be optimized. 11th Hot Topics in Privacy Enhancing Technologies (HotPETs).

Holland, S., Hosny, A., Newman, S., Joseph, J. & Chmielinski, K. (2018). The dataset nutrition label: a framework to drive higher data quality standards. Preprint at https://arxiv.org/abs/1805.03677.

Jobin, A., Ienca, M., & Vayena, E. (2019). Artificial Intelligence: the global landscape of ethics guidelines. arXiv preprint arXiv:1906.11668.

Kim, M.P., Ghorbani, A. and Zou, J. (2019), ‘Multiaccuracy: black-box post-processing for fairness in classification’. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 247-254.

Lambrecht, A., & Tucker, C. (2019). Algorithmic bias? An empirical study of apparent gender-based discrimination in the display of stem career ads. Management Science (2019). https://doi.org/10.1287/mnsc.2018.3093

Leavy, S. (2018). Gender bias in artificial intelligence: the need for diversity and gender theory in machine learning. In Proceedings of the 1st International Workshop on Gender Equality in Software Engineering. ACM.

National Institute of Standards and Technology, U.S. NIST (2019) Face Recognition Vendor Test (FRVT) Part 2: Identification. https://www.nist.gov/publications/face-recognition-vendor-test-frvt-part-2-identification.

Nielsen, M. W., Andersen, J. P., Schiebinger, L., & Schneider, J. W. (2017). One and a half million medical papers reveal a link between author gender and attention to gender and sex analysis. Nature Human Behaviour, 1(11), 791.

Nielsen, M. W., Bloch, C.W., Schiebinger, L. (2018). Making gender diversity work for scientific discovery and innovation. Nature Human Behaviour, 2, 726-734.

Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. New York: NYU Press.

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453.

Popejoy, A. B., & Fullerton, S. M. (2016). Genomics is failing on diversity. Nature, 538(7624), 161-164.

Prates, M. O., Avelar, P. H., & Lamb, L. (2018). Assessing gender bias in machine translation—a case study with Google Translate. arXiv preprint arXiv:1809.02208.

Schiebinger, L., Klinge, I., Sánchez de Madariaga, I., Paik, H. Y., Schraudner, M., and Stefanick, M. (Eds.) (2011-2019). Gendered innovations in science, health & medicine, engineering and environment, engineering, machine translation.

Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). No classification without representation: assessing geodiversity issues in open data sets for the developing world. arXiv preprint arXiv:1711.08536.

Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243.

Sweeney, L. (2013). Discrimination in online ad delivery. Queue, 11(3), 10.

Tannenbaum, C., Ellis, R., Eyssel, F., Zou, J, & Schiebinger, L. (2019) Sex and gender analysis improves science and engineering, Nature, 575(7781), 137-146.

Tatman, R. (2017) Gender and dialect bias in YouTube's automatic captions. In Proceedings of the First Workshop on Ethics in Natural Language Processing (pp. 53–59). ACL.

Wagner, C., Garcia, D., Jadidi, M., & Strohmaier, M. (2015). It's a man's Wikipedia? Assessing gender inequality in an online encyclopedia. In Ninth International AAAI Conference on Web and Social Media,v (pp. 454-463).
Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., & Baeza-Yates, R. (2017). Fa*ir: a fair top-k ranking algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 1569-1578). ACM.
Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K.-W. (2017). Men also like shopping: reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457.
Zhao, J., Zhou, Y., Li, Z., Wang, W., & Chang, K. W. (2018). Learning Gender-Neutral Word Embeddings. arXiv preprint arXiv:1809.01496.
Zou, J. & Schiebinger, L. (2018). Design AI that’s fair. Nature, 559(7714), 324-326.

Sex & Gender Analysis

Case Studies

Analyzing Gender and Intersectionality in Machine Learning

Works Cited

Sex & Gender Analysis

Case Studies

Analyzing Gender and Intersectionality in Machine Learning

Related Case Studies

Works Cited