Resources › Whitepaper › Identifying and mitigating gender bias in structured interview responses › Methods and results
Methods and results
Data for this experiment comes from 633,413 candidates (52.89% female) who applied to two large retail organizations in the UK and Australia. Candidates participated in an online chat-based structured interview where they answered 5-7 open-ended questions related to past behavior and situational judgment on the Sapia’s Chat Interview platform.
Candidates’ textual answers were used to calculate a number of features, including their HEXACO personality traits (Ashton & Lee, 2007); behavioral competencies such as drive, resourcefulness, accountability, language fluency, and their job-hopping motive (Lake et al., 2018). An interested reader can find further details of the text to personality and text to
job-hopping motive inference models in Jayaratne and Jayatilleke (2020 & 2021).
In order to quantify the gender information available at raw and transformed formats, we trained classification models for predicting gender. We trained multiple models using a variety of machine learning algorithms, including a linear model, tree models with bagging
and boosting, a support vector machine model, a neural network model with a single hidden layer, suitable for tabular data for predicting gender from 21 inferred features (see Table 2 for the list of features).
For predicting gender from textual responses, we trained an Attention-Based Bidirectional Long Short-Term Memory (Attn-BiLSTM; Zhou et al., 2016) model, a deep learning algorithm with superior performance on text classification tasks. Table 1 presents the accuracy, precision, recall and F1 scores achieved by these models on the unseen 10% test data. With 78.47% accuracy, 80.45% and 76.05% F1 scores for female and male groups,
respectively, the raw candidate responses can be considered to carry higher levels of gender information. The models based on derived features recorded consistently weaker accuracies around 60% across all models. In terms of F1, the models recorded higher scores (64-66%) for females compared to males (52-56%). These values are still significantly lower than the values recorded by the raw text-based models. Note that we found it sufficient to show the outcomes for a single algorithm in the case of raw text as any improvement made over current results using a different algorithm would only make the case stronger.
To read the rest of this report, download it at the title page.