In this paper, we study \xw{dataset distillation (DD)}, from a novel perspective and introduce a \emph{dataset factorization} approach, termed \emph{HaBa}, which is a plug-and-play strategy portable to any existing DD baseline. Unlike conventional DD approaches that aim to produce distilled and representative samples, \emph{HaBa} explores decomposing a dataset into two components: data \emph{Ha}llucination networks and \emph{Ba}ses, where the latter is fed into the former to reconstruct image samples. The flexible combinations between bases and hallucination networks, therefore, equip the distilled data with exponential informativeness gain, which largely increase the representation capability of distilled datasets. To furthermore increase the data efficiency of compression results, we further introduce a pair of adversarial contrastive constraints on the resultant hallucination networks and bases, which increase the diversity of generated images and inject more discriminant information into the factorization. Extensive comparisons and experiments demonstrate that our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65\%. Moreover, distilled datasets by our approach also achieve \textasciitilde10\% higher accuracy than baseline methods in cross-architecture generalization. Our code is available \href{https://github.com/Huage001/DatasetFactorization}{here}.
翻译:在本文中,我们从新角度研究\xw{datastillation(DD)},并引入了一种称为\emph{dataset consicization}方法,称为\emph{HaBa},这是一个可以移植到任何现有的DD基线的插件和游戏策略。与旨在产生蒸馏和具有代表性的样本的常规DDD方法不同,\emph{HaBa}探索将数据集分解成两个组成部分:数据 \emph{Ha}Ha}clination 网络和\emph{Ba}s,后者被输入到前者以重建图像样本。因此,基础和幻觉网络之间的灵活组合,使蒸馏数据具有指数信息性的信息性增益,这在很大程度上提高了蒸馏数据集的表达能力。为了进一步提高压缩结果的精确性,我们进一步引入了对结果幻觉网络和基底部的对比性限制,这增加了生成图像的多样性,并给系数性信息注入了更扭曲性的信息。 广泛的比较和实验表明,我们的方法在下游线的精确度上可以带来显著的改进。