Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbf{GeoDM}, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.
翻译:数据集蒸馏旨在合成原始数据的一个紧凑子集,使得在该子集上训练的模型能够达到与在原始大规模数据集上训练模型相当的性能。现有的分布匹配方法局限于欧几里得空间,仅能捕捉线性结构而忽略了真实数据的内在几何特性(如曲率)。然而,高维数据通常位于低维流形上,这表明数据集蒸馏应使蒸馏后的数据流形与原始数据流形对齐。本文提出一种几何感知的分布匹配框架,称为 \\textbf{GeoDM},该框架在欧几里得、双曲和球面流形的笛卡尔积空间中操作,通过统一表示同时捕获平坦、层次和循环结构。为适应底层数据几何,我们为三种几何类型引入了可学习的曲率和权重参数。同时,我们设计了一种最优传输损失以增强分布保真度。理论分析表明,在乘积空间中进行几何感知分布匹配相比欧几里得方法具有更小的泛化误差界。在标准基准数据集上的大量实验表明,我们的算法优于当前最先进的数据蒸馏方法,并在针对单一几何结构的多种分布匹配策略中均保持有效性。