DDC-BENCH: 数据集集中基准 (DC-BENCH: Dataset Condensation Benchmark)

Dataset Condensation is a newly emerging technique aiming at learning a tiny dataset that captures the rich information encoded in the original dataset. As the size of datasets contemporary machine learning models rely on becomes increasingly large, condensation methods become a prominent direction for accelerating network training and reducing data storage. Despite numerous methods have been proposed in this rapidly growing field, evaluating and comparing different condensation methods is non-trivial and still remains an open issue. The quality of condensed dataset are often shadowed by many critical contributing factors to the end performance, such as data augmentation and model architectures. The lack of a systematic way to evaluate and compare condensation methods not only hinders our understanding of existing techniques, but also discourages practical usage of the synthesized datasets. This work provides the first large-scale standardized benchmark on Dataset Condensation. It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods through the lens of their generated dataset. Leveraging this benchmark, we conduct a large-scale study of current condensation methods, and report many insightful findings that open up new possibilities for future development. The benchmark library, including evaluators, baseline methods, and generated datasets, is open-sourced to facilitate future research and application.

翻译：数据集中是一种新兴技术,旨在学习一个小型数据集,捕捉原始数据集所编码的丰富信息。随着现代机器学习模型所依赖的数据集规模越来越大,凝析方法成为加速网络培训和减少数据储存的突出方向。尽管在这个迅速增长的领域提出了许多方法,但评价和比较不同的凝结方法并非易碎,仍是一个未决问题。压缩数据集的质量常常被许多关键因素,如数据增强和模型结构等,对最终性能的贡献因素所掩盖。缺乏系统化的评估和比较凝聚方法不仅妨碍我们对现有技术的理解,而且妨碍综合数据集的实际使用。这项工作为数据集中集中提供了第一个大规模标准化基准。它包含一套评价,通过生成的数据集的透镜,全面反映凝结方法的可变性和有效性。我们利用这一基准,对当前凝结方法进行大规模研究,并比较凝聚方法不仅妨碍我们对现有技术的理解,而且妨碍对综合数据集的实际使用。这项工作提供了第一个大规模的数据集中标准化基准基准。它包括:为未来开发开发而开发而开发的新的图书馆、提供新的基准和评估可能性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日