This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central to the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram. Without assuming access to dataset labels, the BoP representation provides a rich characterization of the dataset semantic distribution. Furthermore, BoP representations cooperate well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Although very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.
翻译:本文研究了两种数据集级任务的数据集向量化方法:评估训练集的适用性和测试集的难度。前者衡量了训练集对于目标领域的适用性,而后者研究了学习模型对于测试集的挑战程度。这两个任务的核心在于测量数据集之间的潜在关系。这需要一种理想的数据集向量化方案,应该尽可能地保留有区别的数据集信息,以便生成的数据集向量之间的距离可以反映数据集之间的相似性。为此,我们提出了一种基于原型袋(BoP)的数据集表示,将由补丁描述符组成的图像级别袋扩展到由语义原型组成的数据集级别袋。具体来说,我们从参考数据集中聚类出由K个原型组成的码本。对于要进行编码的数据集,我们将其每个图像特征量化为码本中的某个原型,从而获得一个K维的直方图。在不假设对数据集标签的访问情况下,BoP表示提供了丰富的数据集语义分布特征。此外,BoP表示与Jensen-Shannon距离测量数据集之间的相似性配合得很好。尽管非常简单,但BoP在一系列基准测试中一致显示出了其优势,适用于两种数据集级任务。