The popularity of deep learning has led to the curation of a vast number of massive and multifarious datasets. Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems such as (a) high model-training time; (b) slow research iteration; and (c) poor eco-sustainability. As an alternative, data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset for scenarios like model training, inference, architecture search, etc. In this survey, we present a formal framework for data distillation, along with providing a detailed taxonomy of existing approaches. Additionally, we cover data distillation approaches for different data modalities, namely images, graphs, and user-item interactions (recommender systems), while also identifying current challenges and future research directions.
翻译:深层学习的普及导致大量大型和多种数据集的整理。尽管在个人任务上表现接近人,但大型数据集的培训参数饥饿模型提出了多方面的问题,例如:(a) 高模型培训时间;(b) 研究迭代速度慢;(c) 生态可持续性差。作为一种替代办法,数据蒸馏方法旨在合成梯子数据摘要,这可以有效地取代模型培训、推论、建筑搜索等情景的原始数据集。我们在这次调查中提出了一个数据蒸馏的正式框架,同时提供现有方法的详细分类。此外,我们涵盖不同数据模式的数据蒸馏方法,即图像、图表和用户项目互动(共融系统),同时确定当前的挑战和今后的研究方向。