Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences where there is in fact adequate sample coverage and predictive performance. We propose instead a robust framework for tests of dataset shift based on outlier scores, D-SOS for short. D-SOS detects adverse shifts and can identify false alarms caused by benign ones. It posits that a new (test) sample is not substantively worse than an old (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates. Beyond comparing distributions, users can define what worse means in terms of predictive performance and other relevant notions. We show how versatile and practical D-SOS is for a wide range of real and simulated datasets. Unlike tests of equal distribution and of goodness-of-fit, the D-SOS tests are uniquely tailored to serve as robust performance metrics to monitor model drift and dataset shift.
翻译:数据集转换的统计测试容易出现假警报:它们敏感地发现在实际存在足够抽样覆盖率和预测性能的情况下存在微小差异的微小差异。我们提议建立一个强有力的框架,用于根据外部分数测试数据集转换,D-SOS用于短期。D-SOS检测到不利变化,并能够识别由良性数据集造成的虚假警报。它假定新的(测试)样本不会比旧(培训)样本严重,而不是两者相等。关键的想法是减少对外部分数的观测并比较污染率。除了比较分布之外,用户还可以界定预测性能和其他相关概念方面最差的手段。我们展示D-SOS对于各种真实和模拟数据集的多功能性和实用性。与平等分布和良好性测试不同,D-SOS测试具有独特的针对性,可以作为监测模型漂移和数据集变化的可靠性能指标。