Psychological assessments typically rely on structured rating scales, which cannot incorporate the rich nuance of a respondent's natural language. This study leverages recent LLM advances to harness qualitative data within a novel conceptual framework, combining LLM-scored text and traditional rating-scale items to create an augmented test. We demonstrate this approach using depression as a case study, developing and assessing the framework on a real-world sample of upper secondary students (n=693) and corresponding synthetic dataset (n=3,000). On held-out test sets, augmented tests achieved statistically significant improvements in measurement precision and accuracy. The information gain from the LLM items was equivalent to adding between 6.3 (real data) and 16.0 (synthetic data) items to the original 19-item test. Our approach marks a conceptual shift in automated scoring that bypasses its typical bottlenecks: instead of relying on pre-labelled data or complex expert-created rubrics, we empirically select the most informative LLM scoring instructions based on calculations of item information. This framework provides a scalable approach for leveraging the growing stream of transcribed text to enhance traditional psychometric measures, and we discuss its potential utility in clinical health and beyond.
翻译:心理评估通常依赖于结构化评定量表,这些量表无法纳入受访者自然语言的丰富细微差别。本研究利用最新的大型语言模型(LLM)进展,在一个新颖的概念框架内利用定性数据,将LLM评分的文本与传统评定量表项目相结合,以创建增强型测试。我们以抑郁症为案例研究,在真实世界的高中生样本(n=693)和相应的合成数据集(n=3000)上开发并评估了该框架。在留出的测试集上,增强测试在测量精度和准确性方面取得了统计学上的显著提升。LLM项目带来的信息增益,相当于在原有19个项目的基础上增加了6.3个(真实数据)至16.0个(合成数据)项目。我们的方法标志着自动化评分领域的一个概念性转变,它绕过了其典型的瓶颈:我们不再依赖预先标记的数据或复杂的专家创建的评分标准,而是基于项目信息量的计算,通过实证方法选择信息量最大的LLM评分指令。该框架为利用日益增长的转录文本流来增强传统心理测量工具提供了一种可扩展的方法,我们讨论了其在临床健康及其他领域的潜在应用价值。