Deep learning technology has developed unprecedentedly in the last decade and has become the primary choice in many application domains. This progress is mainly attributed to a systematic collaboration in which rapidly growing computing resources encourage advanced algorithms to deal with massive data. However, it has gradually become challenging to handle the unlimited growth of data with limited computing power. To this end, diverse approaches are proposed to improve data processing efficiency. Dataset distillation, a dataset reduction method, addresses this problem by synthesizing a small typical dataset from substantial data and has attracted much attention from the deep learning community. Existing dataset distillation methods can be taxonomized into meta-learning and data matching frameworks according to whether they explicitly mimic the performance of target data. Although dataset distillation has shown surprising performance in compressing datasets, there are still several limitations such as distilling high-resolution data. This paper provides a holistic understanding of dataset distillation from multiple aspects, including distillation frameworks and algorithms, factorized dataset distillation, performance comparison, and applications. Finally, we discuss challenges and promising directions to further promote future studies on dataset distillation.
翻译:过去十年来,深层学习技术发展得前所未有,成为许多应用领域的主要选择。这一进展主要归功于系统协作,快速增长的计算机资源鼓励先进的算法处理大量数据。然而,处理计算机功率有限的数据无限增长的问题逐渐变得具有挑战性。为此,提出了多种方法来提高数据处理效率。数据集蒸馏(一种减少数据集的方法)通过综合大量数据的小型典型数据集来解决这一问题,并吸引了深层学习界的极大关注。现有的数据集蒸馏方法可以分类成元学习和数据匹配框架,根据它们是否明确模拟目标数据的性能。虽然数据集蒸馏显示在压缩数据集方面有惊人的性能,但仍有一些局限性,如蒸馏高分辨率数据等。本文对数据元蒸馏从多个方面进行的全面理解,包括蒸馏框架和算法、系数化数据集蒸馏、性比较和应用。最后,我们讨论了进一步推进关于数据集蒸馏的未来研究的挑战和有希望的方向。