Machine learning models serve critical functions, such as classifying loan applicants as good or bad risks. Each model is trained under the assumption that the data used in training and in the field come from the same underlying unknown distribution. Often, this assumption is broken in practice. It is desirable to identify when this occurs, to minimize the impact on model performance. We suggest a new approach to detecting change in the data distribution by identifying polynomial relations between the data features. We measure the strength of each identified relation using its R-square value. A strong polynomial relation captures a significant trait of the data which should remain stable if the data distribution does not change. We thus use a set of learned strong polynomial relations to identify drift. For a set of polynomial relations that are stronger than a given threshold, we calculate the amount of drift observed for that relation. The amount of drift is measured by calculating the Bayes Factor for the polynomial relation likelihood of the baseline data versus field data. We empirically validate the approach by simulating a range of changes, and identify drift using the Bayes Factor of the polynomial relation likelihood change.
翻译:机器学习模型具有关键功能,例如将贷款申请人分类为好风险或坏风险。每个模型的培训依据的假设是,培训和实地使用的数据来自相同的深层未知分布。通常,这一假设在实际中是打破的。宜于确定何时发生,以尽量减少对模型性能的影响。我们建议采用一种新办法,通过查明数据特征之间的多元关系来检测数据分布的变化。我们用其R平方值衡量每个确定的关系的强度。一个强大的多元关系捕捉了数据的重要特征,如果数据分布不改变,数据应保持稳定。因此,我们使用一套学到的牢固的多元关系来识别漂移。对于一组比给定阈值强的多元关系,我们计算为该关系观察到的漂移量。漂移量是通过计算基线数据与实地数据之间多面关系可能性的基因数系数来衡量的。我们通过模拟一系列变化,并用多面关系可能性变化的基因数来确定漂移度。