跟踪部署模式的风险并发现有害分布转移 (Tracking the risk of a deployed model and detecting harmful distribution shifts)

When deployed in the real world, machine learning models inevitably encounter changes in the data distribution, and certain -- but not all -- distribution shifts could result in significant performance degradation. In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially, making interventions by a human expert (or model retraining) unnecessary. While several works have developed tests for distribution shifts, these typically either use non-sequential methods, or detect arbitrary shifts (benign or harmful), or both. We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate. In this work, we design simple sequential tools for testing if the difference between source (training) and target (test) distributions leads to a significant drop in a risk function of interest, like accuracy or calibration. Recent advances in constructing time-uniform confidence sequences allow efficient aggregation of statistical evidence accumulated during the tracking process. The designed framework is applicable in settings where (some) true labels are revealed after the prediction is performed, or when batches of labels become available in a delayed fashion. We demonstrate the efficacy of the proposed framework through an extensive empirical study on a collection of simulated and real datasets.

翻译：当在现实世界中部署时,机器学习模式不可避免地会遇到数据分配的变化,而某些 -- -- 但不是全部 -- -- 分配变化可能导致显著的性能退化。在实践中,忽略良性转变可能有意义,在这种转变下,部署模式的性能不会大幅降低,使人类专家(或模式再培训)的干预变得没有必要。虽然一些作品为分配变化开发了测试,但这些测试通常使用非顺序方法,或检测任意转移(或有害),或两者兼而有之。我们争辩说,发出警告的合理方法对于以下两种情况都适用:(a) 发现有害变化而忽略良性变化,以及(b) 允许在不提高虚假警报率的情况下持续监测模型性能。在这项工作中,如果来源(培训)和目标(测试)分布的差异导致兴趣(如准确性或校准)的风险功能显著下降,我们设计简单的连续工具用于测试。在建立时间一致的信任序列方面最近取得的进展使得在跟踪过程中积累的统计证据得以有效汇总。设计的框架适用于(某些)在进行预测后披露真实标签时,以及(我们通过模拟框架展示了)进行大规模的数据效率研究时,或者通过模拟方式展示了现有数据库。