一个基于原型袋的数据集表示用于数据集级应用 (A Bag-of-Prototypes Representation for Dataset-Level Applications)

This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central to the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram. Without assuming access to dataset labels, the BoP representation provides a rich characterization of the dataset semantic distribution. Furthermore, BoP representations cooperate well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Although very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.

翻译：本文研究了两种数据集级任务的数据集向量化方法：评估训练集的适用性和测试集的难度。前者衡量了训练集对于目标领域的适用性，而后者研究了学习模型对于测试集的挑战程度。这两个任务的核心在于测量数据集之间的潜在关系。这需要一种理想的数据集向量化方案，应该尽可能地保留有区别的数据集信息，以便生成的数据集向量之间的距离可以反映数据集之间的相似性。为此，我们提出了一种基于原型袋（BoP）的数据集表示，将由补丁描述符组成的图像级别袋扩展到由语义原型组成的数据集级别袋。具体来说，我们从参考数据集中聚类出由K个原型组成的码本。对于要进行编码的数据集，我们将其每个图像特征量化为码本中的某个原型，从而获得一个K维的直方图。在不假设对数据集标签的访问情况下，BoP表示提供了丰富的数据集语义分布特征。此外，BoP表示与Jensen-Shannon距离测量数据集之间的相似性配合得很好。尽管非常简单，但BoP在一系列基准测试中一致显示出了其优势，适用于两种数据集级任务。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

专知会员服务

26+阅读 · 2020年7月14日

【ICML2020】用于图结构化数据的卷积核网络，Convolutional Kernel Networks for Graph-Structured Data

专知会员服务

44+阅读 · 2020年6月29日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

23+阅读 · 2020年4月22日