Most machine learning (ML) methods assume that the data used in the training phase comes from the target population. However, in practice one often faces dataset shift, which, if not properly taken into account, may decrease the predictive performance of the ML models. In general, if the practitioner knows which type of shift is taking place -- e.g., covariate shift or label shift -- they may apply transfer learning methods to obtain better predictions. Unfortunately, current methods for detecting shift are only designed to detect specific types of shift or cannot formally test their presence. We introduce a general and unified framework that gives insights on how to improve prediction methods by detecting the presence of different types of shift and quantifying how strong they are. Our approach can be used for any data type (tabular/image/text) and both for classification and regression tasks. Moreover, it uses formal hypotheses tests that controls false alarms. We illustrate how our framework is useful in practice using both artificial and real datasets, including an example of how our framework leads to insights that indeed improve the predictive power of a supervised model. Our package for dataset shift detection can be found in https://github.com/felipemaiapolo/detectshift.
翻译:多数机学方法假定培训阶段使用的数据来自目标人群,然而,在实践中,人们常常面临数据集变化,如果没有适当考虑,可能会降低ML模型的预测性能。一般来说,如果执业者知道正在发生哪类转移 -- -- 例如,共变式转移或标签转移 -- -- 他们可能采用转移学习方法以获得更好的预测。不幸的是,目前的检测转移方法仅设计用于检测特定类型的转移,或者无法正式测试其存在。我们引入了一个一般的统一框架,通过发现不同类型转移的存在和量化其强度来了解如何改进预测方法。我们的方法可用于任何数据类型(图/图/图)以及分类和回归任务。此外,它使用正式的假冒测试来控制错误的警报。我们用人工和真实的数据集来说明我们的框架在实践中如何有用,包括我们的框架如何引导人们了解如何确实改进监督模型的预测性能。我们的数据转换检测工具包可以在 https://gibath/stimationma.