Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
翻译:多模态数据集蒸馏旨在合成一小组图像-文本对,以实现大规模视觉-语言模型的高效训练。尽管数据集蒸馏在单模态任务中已展现出潜力,将其扩展到多模态对比学习仍面临关键挑战:学习跨模态对齐以及管理大型编码器的高计算成本。现有方法通过冻结文本编码器并仅更新图像编码器和文本投影层来解决可扩展性问题。然而,我们发现这严重限制了语义对齐,并成为性能扩展的瓶颈。本文提出CovMatch,一种可扩展的数据集蒸馏框架,该框架在正则化各模态内特征分布的同时,对齐真实特征与合成特征的交叉协方差。与先前方法不同,CovMatch实现了两个编码器的联合优化,从而产生更强的跨模态对齐和更优的性能。在Flickr30K和COCO数据集上的评估表明,CovMatch优于当前最先进的多模态蒸馏方法,仅使用500个合成对即可在检索准确率上实现高达6.8%的绝对提升。