Machine learning (ML) models serve critical functions, such as classifying loan applicants as good or bad risks. Each model is trained under the assumption that the data used in training, and the data used in field come from the same underlying unknown distribution. Often this assumption is broken in practice. It is desirable to identify when this occurs in order to minimize the impact on model performance. We suggest a new approach to detect change in the data distribution by identifying polynomial relations between the data features. We measure the strength of each identified relation using its R-square value. A strong polynomial relation captures a significant trait of the data which should remain stable if the data distribution does not change. We thus use a set of learned strong polynomial relations to identify drift. For a set of polynomial relations that are stronger than a given desired threshold, we calculate the amount of drift observed for that relation. The amount of drift is estimated by calculating the Bayes Factor for the polynomial relation likelihood of the baseline data versus field data. We empirically validate the approach by simulating a range of changes in three publicly-available data sets, and demonstrate the ability to identify drift using the Bayes Factor of the polynomial relation likelihood change.
翻译:机器学习(ML) 模型具有关键功能,例如将贷款申请人分类为好或坏风险。 每个模型的培训依据的假设是,培训中所使用的数据和实地使用的数据来自相同的深层未知分布。 通常这一假设在实际中被打破。 有必要确定何时发生这种变化,以尽量减少对模型性能的影响。 我们建议采用一种新办法,通过查明数据特征之间的多元关系来检测数据分布的变化。 我们用R平方值衡量每个确定的关系的强度。 一个强大的多米关系捕捉了数据的重要特征,如果数据分布没有改变,这些数据应该保持稳定。 因此,我们使用一套学到的强的多面关系来识别漂移。 对于一组比既定阈值强的多面关系,我们计算了该关系观察到的漂移量。 我们通过计算基线数据相对于实地数据的多面关系概率系数来估计漂移量。 我们通过模拟三种可公开获取的数据集中的一系列变化,并用模型来验证该方法,并展示使用漂移系数确定流的可能性。