Coreset selection, which aims to select a subset of the most informative training samples, is a long-standing learning problem that can benefit many downstream tasks such as data-efficient learning, continual learning, neural architecture search, active learning, etc. However, many existing coreset selection methods are not designed for deep learning, which may have high complexity and poor generalization performance. In addition, the recently proposed methods are evaluated on models, datasets, and settings of different complexities. To advance the research of coreset selection in deep learning, we contribute a comprehensive code library, namely DeepCore, and provide an empirical study on popular coreset selection methods on CIFAR10 and ImageNet datasets. Extensive experiments on CIFAR10 and ImageNet datasets verify that, although various methods have advantages in certain experiment settings, random selection is still a strong baseline.
翻译:核心集选择的目的是选择一组信息最丰富的培训样本,这是一个长期存在的学习问题,可有益于许多下游任务,如数据高效学习、持续学习、神经结构搜索、积极学习等。然而,许多现有的核心集选择方法并非设计为深层学习,这些方法可能非常复杂,一般性能差。此外,最近提出的方法对模型、数据集和复杂程度不同的环境进行了评估。为了推进深层学习对核心集选择的研究,我们贡献了一个全面的代码库库,即DeepCore, 并提供了关于CIFAR10和图像网数据集中流行的核心集选择方法的经验研究。关于CIFAR10和图像网数据集的广泛实验证实,尽管在某些实验环境中,随机选择具有优势,但随机选择仍然是强有力的基准。