Discrimination based on race and ethnicity in personnel selection is a well known and pervasive issue highlighted in numerous studies (Bertrand & Mullainathan, 2004; Kline et al., 2021; Pager et al., 2009).
Most of these studies report name-based inference of race and ethnicity by human reviewers leading to differential outcomes in the recruitment process. Linguistic racism is a form of discrimination that occurs based on one’s use of language, especially English (De Costa, 2020), and is highly associated with race and ethnicity.
As machine learning models are adopted to automate tasks like interview scoring, race or ethnicity encoded signals in language can lead to biased outcomes, if not mitigated. Hence understanding the level of ethnicity encoded signals in language is important when building natural language-based machine learning models in order to avoid biased outcomes, for example by using feature scores rather than raw text to score responses (Jayaratne, Jayatilleke, Dai, 2022).
In this work, we sought to quantify and compare the amount of ethnicity encoded information in over 300,000 candidates’ raw text interview responses to language-derived feature scores, including personality, behavioral competencies, and communication skills.
First, we trained machine learning models to predict candidate ethnicity from raw-text chat interview responses. Specifically, we trained an Attention-Based Bidirectional Long Short-Term Memory (Attn-BiLSTM) (Zhou et al., 2016) model for predicting ethnicity from textual responses.
Secondly, we tested the same for the language-derived features used in the automated scoring of the interview responses. We trained multiple models using a variety of machine learning algorithms (a linear model, tree models with bagging and boosting, and a neural network model with a single hidden layer) suitable for tabular data for predicting ethnicity from the 21 derived features.
Each model was then used to predict ethnicity for the 10% of the sample left out of the training dataset. The results from the classification tasks show a clear distinction between the ability to infer ethnicity based on natural language and inferred features. As hypothesized, we found that features derived according to a clearly defined rubric contain significantly less ethnicity information compared to raw candidate responses. That is, the models based on derived features recorded consistently weaker accuracy, precision, recall, and F1 values across all models compared to the model for the raw text candidate responses.
This research demonstrates the benefit of using algorithmically derived feature values in mitigating ethnicity related biases when scoring structured interview responses. Specifically, our results show that natural language responses to interview questions carry higher amounts of ethnicity information compared to features derived according to a clearly defined rubric for assessing interview responses. This further strengthens the case for using structured interviews that have been shown to reduce bias over unstructured interviews (Levashina et al., 2014) with much stronger criterion validity (Sackett et al., 2021).
References:
Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. American Economic Review, 94(4), 991–1013.
De Costa, P. I. (2020). Linguistic racism: Its negative effects and why we need to contest it. International Journal of Bilingual Education and Bilingualism, 23(7), 833–837.
Jayaratne, M., Jayatilleke, B., & Yimeng Dai (2022). Identifying and mitigating gender bias in structured interview responses [Paper presentation]. 2022 Society for Industrial Organizational Psychology Conference. Seattle, Washington, United States.
Kline, P. M., Rose, E. K., & Walters, C. R. (2021). Systemic Discrimination Among Large U.S. Employers (NBER Working Papers No. 29053). National Bureau of Economic Research, Inc.
Levashina, J., Hartwell, C. J., Morgeson, F. P., & Campion, M. A. (2014). The Structured Employment Interview: Narrative and Quantitative Review of the Research Literature. Personnel Psychology, 67(1), 241–293.
Pager, D., Western, B., & Bonikowski, B. (2009). Discrimination in a Low-Wage Labor Market: A Field Experiment. American Sociological Review, 74(5), 777–799.
Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2021). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology.
Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016). Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 207–212.