Dataset Condensation is a newly emerging technique aiming at learning a tiny dataset that captures the rich information encoded in the original dataset. As the size of datasets contemporary machine learning models rely on becomes increasingly large, condensation methods become a prominent direction for accelerating network training and reducing data storage. Despite numerous methods have been proposed in this rapidly growing field, evaluating and comparing different condensation methods is non-trivial and still remains an open issue. The quality of condensed dataset are often shadowed by many critical contributing factors to the end performance, such as data augmentation and model architectures. The lack of a systematic way to evaluate and compare condensation methods not only hinders our understanding of existing techniques, but also discourages practical usage of the synthesized datasets. This work provides the first large-scale standardized benchmark on Dataset Condensation. It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods through the lens of their generated dataset. Leveraging this benchmark, we conduct a large-scale study of current condensation methods, and report many insightful findings that open up new possibilities for future development. The benchmark library, including evaluators, baseline methods, and generated datasets, is open-sourced to facilitate future research and application.
翻译:数据集中是一种新兴技术,旨在学习一个小型数据集,捕捉原始数据集所编码的丰富信息。随着现代机器学习模型所依赖的数据集规模越来越大,凝析方法成为加速网络培训和减少数据储存的突出方向。尽管在这个迅速增长的领域提出了许多方法,但评价和比较不同的凝结方法并非易碎,仍是一个未决问题。压缩数据集的质量常常被许多关键因素,如数据增强和模型结构等,对最终性能的贡献因素所掩盖。缺乏系统化的评估和比较凝聚方法不仅妨碍我们对现有技术的理解,而且妨碍综合数据集的实际使用。这项工作为数据集中集中提供了第一个大规模标准化基准。它包含一套评价,通过生成的数据集的透镜,全面反映凝结方法的可变性和有效性。我们利用这一基准,对当前凝结方法进行大规模研究,并比较凝聚方法不仅妨碍我们对现有技术的理解,而且妨碍对综合数据集的实际使用。这项工作提供了第一个大规模的数据集中标准化基准基准。它包括:为未来开发开发而开发而开发的新的图书馆、提供新的基准和评估可能性。