Vision-language pre-training (VLP) has attracted increasing attention recently. With a large amount of image-text pairs, VLP models trained with contrastive loss have achieved impressive performance in various tasks, especially the zero-shot generalization on downstream datasets. In practical applications, however, massive data are usually collected in a streaming fashion, requiring VLP models to continuously integrate novel knowledge from incoming data and retain learned knowledge. In this work, we focus on learning a VLP model with sequential chunks of image-text pair data. To tackle the catastrophic forgetting issue in this multi-modal continual learning setting, we first introduce pseudo text replay that generates hard negative texts conditioned on the training images in memory, which not only better preserves learned knowledge but also improves the diversity of negative samples in the contrastive loss. Moreover, we propose multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models. We incrementally pre-train our model on both the instance and class incremental splits of the Conceptual Caption dataset, and evaluate the model on zero-shot image classification and image-text retrieval tasks. Our method consistently outperforms the existing baselines with a large margin, which demonstrates its superiority. Notably, we realize an average performance boost of $4.60\%$ on image-classification downstream datasets for the class incremental split.
翻译:最近,通过大量图像-文字数据,人们日益关注视野前培训(VLP)的模型。随着大量图像-文字数据,经过对比性损失培训的VLP模型在各种任务中取得了令人印象深刻的成绩,特别是下游数据集的零射概观。然而,在实际应用中,通常会以流式方式收集大量数据,要求VLP模型不断整合从获取的数据中获得的新知识并保留学到的知识。在这项工作中,我们侧重于学习含有相继图像-文字数据块块的VLP模型。为了解决这一多式持续学习环境中灾难性的遗忘问题,我们首先引入了以记忆中培训图像为条件生成硬反文本的假文本,不仅更好地保存了所学知识,而且提高了反向损失中负面样本的多样性。此外,我们提议在图像和文字之间进行多模式的蒸馏,以使老式和新模式之间的实例预测保持一致。我们逐渐在概念化卡普图数据集的例和类递增分解分解时,我们首先引入了以零射式图像图像生成模型的硬反文本,然后对零射式图像进行评估模型的模型,并持续进行升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级,以显示我们现有的图像分析。