Text embedding models from Natural Language Processing can map text data (e.g. words, sentences, documents) to supposedly meaningful numerical representations (a.k.a. text embeddings). While such models are increasingly applied in social science research, one important issue is often not addressed: the extent to which these embeddings are valid representations of constructs relevant for social science research. We therefore propose the use of the classic construct validity framework to evaluate the validity of text embeddings. We show how this framework can be adapted to the opaque and high-dimensional nature of text embeddings, with application to survey questions. We include several popular text embedding methods (e.g. fastText, GloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct validity analyses. We find evidence of convergent and discriminant validity in some cases. We also show that embeddings can be used to predict respondent's answers to completely new survey questions. Furthermore, BERT-based embedding techniques and the Universal Sentence Encoder provide more valid representations of survey questions than do others. Our results thus highlight the necessity to examine the construct validity of text embeddings before deploying them in social science research.
翻译:自然语言处理中嵌入的文本模型可以将文本数据(例如文字、句子、文件)映射为所谓有意义的数字表示(a.k.a.a.案文嵌入)。虽然这些模型越来越多地应用于社会科学研究,但一个重要问题往往没有得到解决:这些嵌入在多大程度上是社会科学研究相关结构的有效表示。因此,我们提议使用典型的构建有效性框架来评价文本嵌入的有效性。我们展示了这一框架如何能够适应文本嵌入的不透明和高维的性质,并应用于调查问题。我们把几种受欢迎的文本嵌入方法(例如快图、GloVe、BERT、BERT、G-BERT、通用句子编码)纳入到我们的构建有效性分析中。我们在某些案例中找到一致和矛盾有效性的证据。我们还表明,嵌入能够用来预测被申请人对全新的调查问题的答案。此外,基于ERT的嵌入技术和通用句子编码提供了比其它问题更为有效的调查问题的描述。我们的结果突出表明,在将这些文本嵌入之前必须先进行社会研究。