Across a wide variety of domains, there exists a performance gap between machine learning models' accuracy on dataset benchmarks and real-world production data. Despite the careful design of static dataset benchmarks to represent the real-world, models often err when the data is out-of-distribution relative to the data the models have been trained on. We can directly measure and adjust for some aspects of distribution shift, but we cannot address sample selection bias, adversarial perturbations, and non-stationarity without knowing the data generation process. In this paper, we outline two methods for identifying changes in context that lead to distribution shifts and model prediction errors: leveraging human intuition and expert knowledge to identify first-order contexts and developing dynamic benchmarks based on desiderata for the data generation process. Furthermore, we present two case-studies to highlight the implicit assumptions underlying applied machine learning models that tend to lead to errors when attempting to generalize beyond test benchmark datasets. By paying close attention to the role of context in each prediction task, researchers can reduce context shift errors and increase generalization performance.
翻译:尽管仔细设计了静态数据集基准,以代表真实世界,但相对于模型所培训的数据而言,当数据在分配之外时,模型往往是错误的。我们可以直接测量和调整分布变化的某些方面,但我们不能在不了解数据生成过程的情况下解决抽样选择偏差、对抗性扰动和非静态问题。在本文中,我们概述了两种方法,用以确定导致分布转移和模型预测错误的变化:利用人类直觉和专家知识确定第一阶背景,并根据数据生成过程的底线制定动态基准。此外,我们提出了两个案例研究,以突出应用机器学习模型背后的隐含假设,这些假设往往导致在试图超越测试基准数据集范围时出现误差。研究人员通过密切关注环境在每项预测任务中的作用,可以减少背景转移错误并增加概括性表现。