Machine learning models are often brittle on production data despite achieving high accuracy on benchmark datasets. Benchmark datasets have traditionally served dual purposes: first, benchmarks offer a standard on which machine learning researchers can compare different methods, and second, benchmarks provide a model, albeit imperfect, of the real world. The incompleteness of test benchmarks (and the data upon which models are trained) hinder robustness in machine learning, enable shortcut learning, and leave models systematically prone to err on out-of-distribution and adversarially perturbed data. The mismatch between a single static benchmark dataset and a production dataset has traditionally been described as a dataset shift. In an effort to clarify how to address the mismatch between test benchmarks and production data, we introduce context shift to describe semantically meaningful changes in the underlying data generation process. Moreover, we identify three methods for addressing context shift that would otherwise lead to model prediction errors: first, we describe how human intuition and expert knowledge can identify semantically meaningful features upon which models systematically fail, second, we detail how dynamic benchmarking - with its focus on capturing the data generation process - can promote generalizability through corroboration, and third, we highlight that clarifying a model's limitations can reduce unexpected errors. Robust machine learning is focused on model performance beyond benchmarks, and as such, we consider three model organism domains - facial expression recognition, deepfake detection, and medical diagnosis - to highlight how implicit assumptions in benchmark tasks lead to errors in practice. By paying close attention to the role of context, researchers can design more comprehensive benchmarks, reduce context shift errors, and increase generalizability.
翻译:尽管基准数据集与生产数据集之间的不匹配历来被描述为数据集的转变。为了澄清如何解决测试基准与生产数据之间的不匹配问题,我们引入了背景变换,以描述基本数据生成过程中具有意义的变化。 此外,我们确定了处理背景变迁的三种方法,否则会导致模型预测错误:第一,我们描述人类直觉和专家知识如何能识别出系统性医学错误的特征,从而导致模型的系统失灵,第二,我们详细描述动态基准化如何通过侧重于获取数据生成过程,能够通过校验、更集中的机头分析,从而降低模型的可概括性。