As researchers increasingly rely on machine learning models and LLMs to annotate unstructured data, such as texts or images, various approaches have been proposed to correct bias in downstream statistical analysis. However, existing methods tend to yield large standard errors and require some error-free human annotation. In this paper, I introduce Surrogate Representation Inference (SRI), which assumes that unstructured data fully mediate the relationship between human annotations and structured variables. The assumption is guaranteed by design provided that human coders rely only on unstructured data for annotation. Under this setting, I propose a neural network architecture that learns a low-dimensional representation of unstructured data such that the surrogate assumption remains to be satisfied. When multiple human annotations are available, SRI can be extended to further correct non-differential measurement errors that may exist in human annotations. Focusing on text-as-outcome settings, I formally establish the identification conditions and semiparametric efficient estimation strategies that enable learning and leveraging such a low-dimensional representation. Simulation studies and a real-world application demonstrate that SRI reduces standard errors by over 50% when machine learning classification accuracy is moderate and provides valid inference even when human annotations contain non-differential measurement errors.
翻译:随着研究者日益依赖机器学习模型与大型语言模型对非结构化数据(如文本或图像)进行标注,学界已提出多种方法来修正下游统计分析中的偏差。然而,现有方法往往产生较大的标准误,且需要部分无误差的人工标注。本文提出替代表征推断方法,其假设非结构化数据完全中介了人工标注与结构化变量之间的关系。该假设在设计上具有保障性——前提是人工标注者仅依据非结构化数据进行标注。在此设定下,本文提出一种神经网络架构,通过学习非结构化数据的低维表征来持续满足替代假设。当存在多个人工标注时,SRI可进一步修正人工标注中可能存在的非差分测量误差。聚焦于文本作为结果变量的场景,本文从形式上建立了识别条件与半参数有效估计策略,使得此类低维表征的学习与利用成为可能。仿真研究与实际应用表明:当机器学习分类准确率处于中等水平时,SRI能将标准误降低50%以上;即使在人工标注存在非差分测量误差的情况下,该方法仍能提供有效的统计推断。