Biased people are much harder to fix than algorithms

We worry intensely about the amplification of lies and prejudices from the technology that fuels social media like Facebook, yet do we hold the mirror up to ourselves and check our tendency to hire in our image?

How many times have you told a candidate they didn’t get the job because they were not the right “culture fit”?

The truth is that we humans are inscrutable in a way that algorithms are not, which means we are often not accountable for our biases.

In algorithms, bias is visible, measurable, trackable and fixable.

A compelling feature of our technology is that our AI can’t see you, hear you, and judge you on irrelevant personal characteristics (like gender, age, skin colour) – as a human can. That’s one reason why trusted consumer brands like Qantas, Superdry, and Bunnings use it to make fair unbiased hiring decisions.

To validate that algorithms are bias-free, we do extensive bias testing (impossible to do for humans). We know from this testing that there is no statistical difference between the way the algorithm works on men, women, and people of different ethnicity.

We use these tests for bias testing:

Our bias testing happens at 3 levels:

Score calculated by the predictive model for each candidate.
Recommendation grouping based on score percentile.
Feature values used by the predictive model for training.

For Gender-bias testing:

To analyse whether our test scores have any gender bias we use t-test and effect size. For testing our recommendations of YES, NO and MAYBE groups, we use chi-square, fisher-exact and the 4/5th rule. This last one is the standard test set by the EEOC for any assessment used for candidate selection.

For Ethnicity-bias testing:

We use the 4/5th rule and the ANOVA test.

For Feature-level bias testing

This is to ensure any of the feature values we are using to assess candidate fit are not of themselves biased, we use t-test, effect size and ANOVA test.

Diving into just one of these, using effect size is easy to understand the statistical measurement of the difference in average scores of males and females. If the effect size is positive in our test set, it means females have higher scores than males and vice versa.

The magnitude of the effect size also matters – the larger the magnitude, the more significant the difference is. We generally consider values smaller than +/- 0.3 a negligible difference, values from +/- 0.3 to +/- 0.5 a moderate level difference, and values larger than +/- 0.8 significantly large level of difference.

We periodically test for score and recommendation bias in our models and take action if the bias highlighted is non-negligible. e.g., the effect size is beyond the range of +/- 0.3 or more, we take action- stop the model until we can find the source of the bias and re-train/re-test the new model to make sure the new model is not biased.

For more insight on how our technology removes bias and how we track and measure bias, read diversity hiring

Sign up to our newsletter