In this section, we introduce the preliminaries of the HEXACO personality model used as the underlying personality model in our study (Section II-A), and the related work around language and personality (Section II-B). We also provide an overview of the methods we use to infer personality from textual content of interview responses. These include the different word and document representation approaches found in natural language processing (Section II-C), the BERT model architecture and the self-attention mechanism (Section II-D) that form the basis for the InterviewBERT model. We ﬁnd that a lengthy discussion of the technical details of the above topics is out of the scope of this paper and refer the reader to the related work we reference under each topic.
A. HEXACO Model
HEXACO  is a six-dimensional model of personality consisting of Honesty-humility (H), Emotionality (E), eXtraversion (X), Agreeableness (A), Conscientiousness (C) and Openness (O) as dimensions.
Similar to the Big Five model  of personality, HEXACO model has its origins in lexical studies and subsequent factor analysis used to identify a minimal set of independent dimensions or personality traits and their underlying facets. It’s relevant to note here that the use of lexical studies are grounded on the lexical hypothesis that claims descriptors of personality characteristics are en-coded in language , a fact we will re-visit in the next section. While there are similarities and subtle differences in the dimensions in HEXACO and the Big Five model, a key difference is the addition of the Honesty-Humility (H)dimension or the H-factor. The H-factor is especially important in the employment assessment context given it represents characteristics desired in a workplace environment such as modesty, fairness and honesty. Previous studies have shown that the H-factor can help explain and predict workplace deviance , delinquency , , integrity , counter-productive work behaviour and organisational citizenship  and job performance .
B. Language and Personality
Language analysis is a ﬁrst-principles approach to understanding psychological constructs as studied in psycholinguistics and the application of lexical hypothesis in discovering personality dimensions. The ﬁeld of psycholinguistics is dedicated to the study of the relationship between language and various psychological aspects related to language acquisition, understanding and human thought , . In  the author details with extensive research on how we speak reveals what we think. More importantly, personality models such as HEXACO and Big Five are grounded on the lexical hypothesis, which states that personality characteristics that are salient in people’s daily transactions and relate to important social outcomes are encoded in language , . Advances in machine learning and natural language processing (NLP) have catalysed the growing body of evidence showing the relation- ship between one’s language use and personality . This relationship has been demonstrated in both informal contexts such as social media – as well as in formal contexts such as self-narratives , , and job interviews . The language-personality relationship has been utilised to develop predictive machine learning models to accurately infer personality traits from blogs , essays , microblogs (Twitter, Sina Weibo) , , , , social media
posts , , etc. The success of such attempts has led researchers to propose computer generated personality predictions to “complement – and in some instances replace – traditional self-report measures, which suffer from well-known response biases and are difﬁcult to scale” .
Language modelling within psychological sciences typically involves two types of approaches: the closed-vocabulary approach and the open-vocabulary approach. In closed-vocabulary approaches, words are assigned to psycho-socio-educational relevant categories to create dictionaries that are considered to represent that category. For example, words such as happiness, joy, etc. can be part of a dictionary for positive emotions. Linguistic Inquiry and Word Count (LIWC)  is one such lexicon. Using the LIWC, researchers have found correlations among language patterns and personality , –, . On the other hand, open-vocabulary approaches are more data-driven. In an open-vocabulary NLP system, algorithms process a large set of linguistic data and identify semantically related words through numerical word representation methods (We detail these methods in Section II-C), which can be used to predict outcomes using supervised machine learning algorithms or gain further in- sights through exploration using unsupervised algorithms such as clustering. Compared to the closed-vocabulary methods, the open-vocabulary methods build upon the idea that words can be represented with numerical values based on how they co-occur, yielding to powerful language models that allow us to model words according to the contexts in which they appear rather than relying on assumptions about word-category relations. It eliminates the need for a human to have created categories and related dictionaries that limits the vocabulary known to learning algorithms. Open-vocabulary approaches are the current de facto standard for modelling language data and usually require a large amount of training data to learn the relationship between personality and language representation.
Such predictive models have been demonstrated on textual data from social media with success –, , .