Written by Nathan Hewitt

Question-aware outlier answer detection for fairer AI scoring of interviews

Artificial Intelligence-based interview scoring learns from past interview answers, which makes it hard for it to determine if a candidate is legitimately answering the question if their response includes context or an example rarely seen in training data. Moreover, AI interviewers may be susceptible to adversarial inputs where an irrelevant answer may receive a high score. Both scenarios raise fairness concerns and can erode trust in AI interviewers (Madaio et al, 2020). 

This is why identifying outliers that differ significantly from the majority of answers and flagging them for manual review become crucial steps toward responsible and fair use of AI interviewers. While simple rule-based methods (Reiz and Pongor, 2011) could help filter out some irrelevant answers based on answer length and regular expressions, these methods do not take into account the context and content of the answer and question. Someone may describe a very unique, yet relevant, situation in response to an interview question, which you wouldn’t want to disregard. 

In this study, we introduce an unsupervised, question-aware, multi-context outlier detection model that can help detect anomalous answers contextually and semantically. The unsupervised approach is deemed to be more practical compared to a supervised model that requires a large labeled dataset of outlier answers. It helps bootstrap an outlier detector that can then be enhanced through human feedback. 

We tested the outlier model to ascertain how well it is able to correctly identify 177,691 actual hired candidate interview answers from outliers, (e.g., movie reviews, news articles, nonsensical text, and sentences generated using BERT (Vaswani et al, 2017) with random starting words). 

Our model outperformed the baseline One-class SVM outlier detector (Li et al, 2003), in detecting outliers from actual interview answers. The performance of our model over the baseline unsupervised model can be explained by both question-aware learning and multi-context learning, which help yield better contextual representations for detecting outlier answers from typical interview answers. 

We also conducted a human evaluation on 10,689 interview answers of candidates who were not hired and might have provided outlier answers. Our model predicted 0.16% of the answers as outliers with only 5.9% of them being false positives. All of these false predictions describe contexts related to family and personal life in their answers but are relevant to the question. It is reasonable that these answers are labeled as an outlier by our model since they are contextually and semantically different from most interview answers.

While a data-driven AI interviewer can help counter flaws in human interviewers, answers that are significantly different to training data can lead to spurious predictive outcomes. In this study, we show how a ​​question-aware multi-context outlier detection model could be applied to identify outlier answers. Flagging such answers for human review enhances fairness as well as provides a supervised signal to improve the outlier detection model over time. 


Dai, Y., Qi, J., & Zhang, R. (2020). Joint recognition of names and publications in academic homepages. In Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 133-141).

Li, K. L., Huang, H. K., Tian, S. F., & Xu, W. (2003). Improving one-class SVM for anomaly detection. In Proceedings of the 2003 international conference on machine learning and cybernetics (IEEE Cat. No. 03EX693) (Vol. 5, pp. 3077-3081). IEEE.

Madaio, M. A., Stark, L., Wortman Vaughan, J., & Wallach, H. (2020). Co-designing checklists to understand organizational challenges and opportunities around fairness in AI. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1-14).

Reiz, B., & Pongor, S. (2011). Psychologically Inspired, Rule-Based Outlier Detection in Noisy Data. In SYNASC (pp. 131-136)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.


Improvements to customer data security and sovereignty

In June 2022, we announced that, thanks to our partnership with AWS, we now have introduced regional data hosting. This means that customers and their candidates will have increased speed when they use the Sapia platform, and means companies using the platform can have confidence that candidate data is treated in line with data sovereignty legislation, such as the EU’s General Data Protection Regulation (GDPR).

Here is the full list of improvements to data security and sovereignty for Sapia customers.

World-leading protections
Sapia’s platform is built on AWS, and is protected by anti-virus, anti-malware, intrusion detection, intrusion protection, and anti-DDoS protocols. We comply with most major cybersecurity requirements, including ISO 27001, Soc 2 Type 1 (Type 2 in progress), and GDPR.

We use AWS’ serverless solution, which can automatically support billions of requests per day. Our sophisticated tech stack includes React.js, GraphQL, MongoDB, Node.js and Terraform.

Regional data hosting
Sapia offers regional data hosting via AWS. All data is processed within highly secure and fault-tolerant data centres, located in the same geography as the one in which the data is stored. All data is stored in and served from AWS data centres using industry standard encryption; both at rest and in while transit.

Availability and reliability
Sapia uses a purpose-built, distributed, fault-tolerant, self-healing storage system that replicates data six ways across three AWS Availability Zones (AZs), making it highly durable. Our storage system is automatic, features continuous data backup, and allows for point-in-time restore (PITR).

Read Online

What can HR learn about risk management from banks?

The Royal Commission has brought about a lot of scrutiny on the banks, and for good reason. But we have to give them credit where it’s due.

Compared to HR teams across the country, banks know a thing or two when it comes to managing risk.

Which is funny, as I’d argue that hiring a staff member is a much riskier proposition for a business than a bank having one of its customers default on a loan.

Imagine if your bank lent you money with the same process that your average recruiter used to hire for a role.

They would ask you to load all of your personal financial information into an exhaustive application form. Your salary, your weekly spend, your financial commitments. All of it.

The same form would include a lot of probing questions, such as:

  • Will you pay this money back on time?
  • When have you borrowed in the past and paid back on time?
  • Describe a time that you struggled to repay a loan and what you did about it?

Then, assuming your form piqued their interest, they would bring you in for one on one meeting with the bank manager. That manager would grill you with a stern look, asking the same questions. This time though, they will be closely watching your eye movement to see if you were lying when you answered.

In each part of the process, you get a score, and then if that number is above a certain threshold, you get the loan.

It’s almost laughable, right?

Banks wouldn’t have any customers if they used that approach.

Only people who desperately need money would put themselves through that process. And they’re likely not the best loan candidates.

Banks work hard to attain incredibly high accuracy levels in assessing loan risk.

Meanwhile in HR, if you use turnover as a measure of hiring accuracy its as low as 30–50% per cent in some sectors. If you combine both turnover and performance data (how many people who get hired really raise a company’s performance), it might be even lower than that.

Banks wouldn’t exist if their risk accuracy was anywhere close to those numbers.

Well, that’s how most recruitment currently works — just usually involving more people.

There are more parallels here than you think.

Just like a bank manager, every recruiter wants to get it right and make the best decisions for the sake of their employer. As everyone in HR knows, hiring is one of the greatest risks a business can take on.

But they are making the highest risk decision for an organisation based on a set of hypotheses, assumptions and lots of imperfect data.

So, let’s flip the thought experiment.

What if a bank’s risk management department was running recruitment? What would the risk assessment look like?

Well, the process wouldn’t involve scanning CVs, a 10-minute phone call, a face to face interview and then a decision.

That would be way too expensive given exponentially more people apply for jobs than apply for loans each year. Not to mention the process itself is too subjective.

I suspect they would want objective proof points on what traits make a candidate successful in a role, data that matches the candidate against those proof points and finally, further cross-validation with other external sources.

They wouldn’t really care if you were white, Asian, gay female. How could you possibly generalise about someone’s gender, sexuality or ethnicity and use it as a lead indicator of hiring risk. (Yet, in HR this is still how we do it.)

Finally, they’d apply a layer of technology to the process. They would make it a positive customer experience for the candidates and with a mobile-first design. Much like a loan, you’ll lose your best customers if the funnel is long and exhaustive.

I’m not saying that banks are a beacon of business. The Royal Commission definitely showed otherwise. But for the most part, they have gotten with the times and upgraded their processes to better manage their risk. It’s time HR do the same.

Suggested Reading:

The CHRO Should Manage Bias Like the CFO Manages the Financials

What Job is HR Being Hired to do?

You can try out Sapia’s Chat Interview right now, or leave us your details to book a demo

Read Online

Why text conversations work better for graduate assessment

Text is central to our everyday lives.

Texting, emailing, writing, all of this is done in natural language. The revolution in natural language processing is changing the way organisations understand, and make use of text. Using text data as the assessment data is compelling when you consider the facts about texting in our everyday lives.

95% of graduates in advanced economies own a mobile phone. Research from Google has shown that Generation Z prefers texting to all other forms of communication. This includes messaging apps and meeting in-person. This is reflected in open rates being an order of magnitude higher for text messages – 90%. Compare that to email open rates (18%) and response rates (8%).

We have candidates completing our text-based FirstInterview assessment every 2 minutes somewhere in the world. Through this we see data the trend towards mobile-first assessment experiences.

Analysing the behaviour of 41,314 candidates from March 2019 to March 2020, more than one third completed their assessment on mobile,  with applicants saving 40% or more on time to complete it on a mobile vs a desktop.


The Power of Language to Assess Values and Traits

In graduate recruitment and with growing unemployment as a result of COVID-19, humanising your recruitment means using assessments that feel human, that are empathetic and respectful to the candidate’s time and effort, that mirror how we live and work every day.

  • Having a text conversation on a mobile phone.
  • Asking typical interview questions to which graduates can relate, e.g ‘which of our values do you connect with and why?’
  • Asking 5 or 6 questions, that respect every applicant as unique and empowers them to share who they are, in their language, sharing their personal stories.
  • Receiving for your eyes only, personalised and constructive feedback within minutes that deepens your self-awareness

These aspects of the experience enhance trust and engagement leading to a 99% positive sentiment rating.

See for yourself why:  Try it Here > 

Answering 5 open-ended questions generate at least 75 different data points about a person. Even 200 words is enough to reveal your true character and personality.

Natural Language Processing (NLP)

It is the combination of NLP (a branch of AI), our unique machine-learning models and our proprietary dataset (containing 25 million words) that form the foundation of our text-based assessment.

Curious to know the science behind the technology? Get in touch with us here

Read Online