Most machine learning (ML) methods assume that the data used in the training phase comes from the distribution of the target population. However, in practice one often faces dataset shift, which, if not properly taken into account, may decrease the predictive performance of the ML models. In general, if the practitioner knows which type of shift is taking place - e.g., covariate shift or label shift - they may apply transfer learning methods to obtain better predictions. Unfortunately, current methods for detecting shift are only designed to detect specific types of shift or cannot formally test their presence. We introduce a general framework that gives insights on how to improve prediction methods by detecting the presence of different types of shift and quantifying how strong they are. Our approach can be used for any data type (tabular/image/text) and both for classification and regression tasks. Moreover, it uses formal hypotheses tests that controls false alarms. We illustrate how our framework is useful in practice using both artificial and real datasets. Our package for dataset shift detection can be found in https://github.com/felipemaiapolo/detectshift.
翻译:多数机学方法假定,培训阶段使用的数据来自目标人群的分布;然而,在实践中,人们往往面临数据集变化,如果不适当地考虑到这种变化,可能会降低ML模型的预测性能;一般来说,如果执业者知道正在发生哪类转移,例如,变换式或标签转移,他们可以采用转移学习方法,以获得更好的预测;不幸的是,目前的测转方法仅设计用于检测特定类型的转移,或者无法正式测试其存在情况。我们引入了一个总体框架,通过发现不同类型转移的存在并量化其强度,就如何改进预测方法提供见解。我们的方法可用于任何数据类型(图示/image/text)以及分类和回归任务。此外,它使用正式的假冒测试来控制虚假警报。我们用人工和真实的数据集来说明我们的框架在实践中如何有用。我们的数据集转换检测包可以在 https://github.com/felipemapolo/deectttraction中找到。