数据集转换诊断的统一框架 (A unified framework for dataset shift diagnostics)

Most machine learning (ML) methods assume that the data used in the training phase comes from the target population. However, in practice one often faces dataset shift, which, if not properly taken into account, may decrease the predictive performance of the ML models. In general, if the practitioner knows which type of shift is taking place -- e.g., covariate shift or label shift -- they may apply transfer learning methods to obtain better predictions. Unfortunately, current methods for detecting shift are only designed to detect specific types of shift or cannot formally test their presence. We introduce a general and unified framework that gives insights on how to improve prediction methods by detecting the presence of different types of shift and quantifying how strong they are. Our approach can be used for any data type (tabular/image/text) and both for classification and regression tasks. Moreover, it uses formal hypotheses tests that controls false alarms. We illustrate how our framework is useful in practice using both artificial and real datasets, including an example of how our framework leads to insights that indeed improve the predictive power of a supervised model. Our package for dataset shift detection can be found in https://github.com/felipemaiapolo/detectshift.

翻译：多数机学方法假定培训阶段使用的数据来自目标人群,然而,在实践中,人们常常面临数据集变化,如果没有适当考虑,可能会降低ML模型的预测性能。一般来说,如果执业者知道正在发生哪类转移 -- -- 例如,共变式转移或标签转移 -- -- 他们可能采用转移学习方法以获得更好的预测。不幸的是,目前的检测转移方法仅设计用于检测特定类型的转移,或者无法正式测试其存在。我们引入了一个一般的统一框架,通过发现不同类型转移的存在和量化其强度来了解如何改进预测方法。我们的方法可用于任何数据类型(图/图/图)以及分类和回归任务。此外,它使用正式的假冒测试来控制错误的警报。我们用人工和真实的数据集来说明我们的框架在实践中如何有用,包括我们的框架如何引导人们了解如何确实改进监督模型的预测性能。我们的数据转换检测工具包可以在 https://gibath/stimationma.

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日