数据集转换诊断的统一框架 (A unified framework for dataset shift diagnostics)

Most machine learning (ML) methods assume that the data used in the training phase comes from the distribution of the target population. However, in practice one often faces dataset shift, which, if not properly taken into account, may decrease the predictive performance of the ML models. In general, if the practitioner knows which type of shift is taking place - e.g., covariate shift or label shift - they may apply transfer learning methods to obtain better predictions. Unfortunately, current methods for detecting shift are only designed to detect specific types of shift or cannot formally test their presence. We introduce a general framework that gives insights on how to improve prediction methods by detecting the presence of different types of shift and quantifying how strong they are. Our approach can be used for any data type (tabular/image/text) and both for classification and regression tasks. Moreover, it uses formal hypotheses tests that controls false alarms. We illustrate how our framework is useful in practice using both artificial and real datasets. Our package for dataset shift detection can be found in https://github.com/felipemaiapolo/detectshift.

翻译：多数机学方法假定,培训阶段使用的数据来自目标人群的分布;然而,在实践中,人们往往面临数据集变化,如果不适当地考虑到这种变化,可能会降低ML模型的预测性能;一般来说,如果执业者知道正在发生哪类转移,例如,变换式或标签转移,他们可以采用转移学习方法,以获得更好的预测;不幸的是,目前的测转方法仅设计用于检测特定类型的转移,或者无法正式测试其存在情况。我们引入了一个总体框架,通过发现不同类型转移的存在并量化其强度,就如何改进预测方法提供见解。我们的方法可用于任何数据类型(图示/image/text)以及分类和回归任务。此外,它使用正式的假冒测试来控制虚假警报。我们用人工和真实的数据集来说明我们的框架在实践中如何有用。我们的数据集转换检测包可以在 https://github.com/felipemapolo/deectttraction中找到。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日