A growing literature on human-AI decision-making investigates strategies for combining human judgment with statistical models to improve decision-making. Research in this area often evaluates proposed improvements to models, interfaces, or workflows by demonstrating improved predictive performance on "ground truth" labels. However, this practice overlooks a key difference between human judgments and model predictions. Whereas humans reason about broader phenomena of interest in a decision -- including latent constructs that are not directly observable, such as disease status, the "toxicity" of online comments, or future "job performance" -- predictive models target proxy labels that are readily available in existing datasets. Predictive models' reliance on simplistic proxies makes them vulnerable to various sources of statistical bias. In this paper, we identify five sources of target variable bias that can impact the validity of proxy labels in human-AI decision-making tasks. We develop a causal framework to disentangle the relationship between each bias and clarify which are of concern in specific human-AI decision-making tasks. We demonstrate how our framework can be used to articulate implicit assumptions made in prior modeling work, and we recommend evaluation strategies for verifying whether these assumptions hold in practice. We then leverage our framework to re-examine the designs of prior human subjects experiments that investigate human-AI decision-making, finding that only a small fraction of studies examine factors related to target variable bias. We conclude by discussing opportunities to better address target variable bias in future research.
翻译:关于人类-大赦国际决策的文献越来越多,它调查了将人类判断与统计模型相结合的战略,以改进决策。这一领域的研究经常通过展示“地面真相”标签上的预测性业绩,评价拟议改进模型、界面或工作流程的建议,但这种做法忽略了人类判断和模型预测之间的一个关键区别。虽然人类对更广泛的决策感兴趣的现象的人类理由,包括并非直接可见的隐性结构,如疾病状况、在线评论的“毒性”或未来的“工作绩效”——预测模型针对现有数据集中现成的代用标签。预测性模型依赖简单化的代用标签使其易受各种统计偏差来源的影响。在本文件中,我们找出了五个目标差异性偏差的来源,可能影响代用标签在人类-大赦国际决策任务中的正确性。我们制定了一个因果框架,以消除在具体人类-大赦国际决策工作中所关切的每一种偏差和澄清之间的关系。我们展示了如何利用我们的框架来阐明在前建模工作中作出的隐含的假设。我们建议评估战略,以便核实这些假设是否在人类-大赦国际的决策工作中具有可变数的实践。我们只研究研究如何利用这些可变数的模型,以便研究人类在人类决策中作出结论性研究。