Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Motivated by this, there is a growing focus on curating benchmark datasets that capture distribution shifts. While valuable, the existing benchmarks are limited in that many of them only contain a small number of shifts and they lack systematic annotation about what is different across different shifts. We present MetaShift--a collection of 12,868 sets of natural images across 410 classes--to address this challenge. We leverage the natural heterogeneity of Visual Genome and its annotations to construct MetaShift. The key construction idea is to cluster images using its metadata, which provides context for each image (e.g. "cats with cars" or "cats in bathroom") that represent distinct data distributions. MetaShift has two important benefits: first, it contains orders of magnitude more natural data shifts than previously available. Second, it provides explicit explanations of what is unique about each of its data sets and a distance score that measures the amount of distribution shift between any two of its data sets. We demonstrate the utility of MetaShift in benchmarking several recent proposals for training models to be robust to data shifts. We find that the simple empirical risk minimization performs the best when shifts are moderate and no method had a systematic advantage for large shifts. We also show how MetaShift can help to visualize conflicts between data subsets during model training.
翻译:了解不同数据分布的机器学习模型的性能对于可靠应用来说至关重要。 受此驱动,人们越来越重视整理基准数据集,以捕捉分布变化。 虽然很有价值,但现有基准有限,因为许多基准只包含少量变化,而且缺乏关于不同变化的系统说明。 我们提出Meta Shift - a 收集了410个等级的12 868套自然图像,以迎接这一挑战。 我们利用视觉基因组及其说明的自然异质性来构建Meta Shift。 关键构建理念是使用其元数据对图像进行分组,为代表不同数据分布的每个图像(例如“汽车猫”或“浴室猫”)提供背景。 MetaShift有两个重要好处: 首先,它包含比以前多的12 868套自然数据变化数量序列。 第二,它明确说明了每个数据集的独特性,以及测量两套数据集之间分布变化程度的距离分数。 我们展示了Metashift在设定每个图像变化的基准中的实用性, 并且我们发现, 将一些系统化的模型转化为最强的模型。 我们发现最优化的模型, 如何进行最优化的模型。