自动探测机器学习分类系统中的数据漂移 (Automatically detecting data drift in machine learning classifiers)

Classifiers and other statistics-based machine learning (ML) techniques generalize, or learn, based on various statistical properties of the training data. The assumption underlying statistical ML resulting in theoretical or empirical performance guarantees is that the distribution of the training data is representative of the production data distribution. This assumption often breaks; for instance, statistical distributions of the data may change. We term changes that affect ML performance `data drift' or `drift'. Many classification techniques compute a measure of confidence in their results. This measure might not reflect the actual ML performance. A famous example is the Panda picture that is correctly classified as such with a confidence of about 60\%, but when noise is added it is incorrectly classified as a Gibbon with a confidence of above 99\%. However, the work we report on here suggests that a classifier's measure of confidence can be used for the purpose of detecting data drift. We propose an approach based solely on classifier suggested labels and its confidence in them, for alerting on data distribution or feature space changes that are likely to cause data drift. Our approach identities degradation in model performance and does not require labeling of data in production which is often lacking or delayed. Our experiments with three different data sets and classifiers demonstrate the effectiveness of this approach in detecting data drift. This is especially encouraging as the classification itself may or may not be correct and no model input data is required. We further explore the statistical approach of sequential change-point tests to automatically determine the amount of data needed in order to identify drift while controlling the false positive rate (Type-1 error).

翻译：根据培训数据的各种统计特性,分类员和其他基于统计数据的机器学习(ML)技术根据培训数据的各种统计特性,普遍化或学习。导致理论或经验性业绩保证的统计ML所依据的假设是,培训数据的分布不正确,代表了生产数据分布。这种假设经常打破;例如,数据的统计分布可能改变。我们用影响ML业绩的“数据漂移”或“漂移”的更改来表示对其结果的信任度。许多分类技术可能不反映实际的ML性能。一个著名的例子就是Panda图象,它被正确地归类为信任大约60 ⁇,但当添加噪音时,它被错误地归类为信任99 ⁇ 的Gibbon。然而,我们在这里报告的工作表明,对数据分类员的信任度的测量可用于检测数据漂移。我们建议的方法完全基于分类师建议的标签和对结果的信心,以提醒数据分布或描述可能造成数据漂移的错误性能。我们的方法是模型性能退化,在模型性能中并不要求将它归类为Gibbon数据本身的精确性,而在数据分类中则需要测量数据的测测测测测算或测测测测算中,这种数据的精确性,这可能不是我们的数据的顺序,而在测测测测测测测测测测测数据。