Multimodal representations and continual learning are two areas closely related to human intelligence. The former considers the learning of shared representation spaces where information from different modalities can be compared and integrated (we focus on cross-modal retrieval between language and visual representations). The latter studies how to prevent forgetting a previously learned task when learning a new one. While humans excel in these two aspects, deep neural networks are still quite limited. In this paper, we propose a combination of both problems into a continual cross-modal retrieval setting, where we study how the catastrophic interference caused by new tasks impacts the embedding spaces and their cross-modal alignment required for effective retrieval. We propose a general framework that decouples the training, indexing and querying stages. We also identify and study different factors that may lead to forgetting, and propose tools to alleviate it. We found that the indexing stage pays an important role and that simply avoiding reindexing the database with updated embedding networks can lead to significant gains. We evaluated our methods in two image-text retrieval datasets, obtaining significant gains with respect to the fine tuning baseline.
翻译:与人类智能密切相关的两个领域是多式表达和持续学习,前者考虑学习共享代表空间,从不同模式中可以比较和整合信息(我们注重语言和视觉表达之间的交叉模式检索),后者研究如何防止在学习新任务时忘记以前学到的任务。虽然人类在这两个方面表现出色,但深层神经网络仍然非常有限。在本文件中,我们建议将这两个问题合并成一个持续的跨模式检索环境,研究新任务造成的灾难性干扰如何影响嵌入空间及其有效检索所需的跨模式调整。我们提出了一个总框架,将培训、索引和查询阶段分离开来。我们还查明和研究可能导致遗忘的不同因素,并提出缓解这些因素的工具。我们发现,索引阶段可以发挥重要作用,而仅仅避免将数据库与更新的嵌入网络重新索引,就能带来重大收益。我们用两个图像-文字检索数据集评估了我们的方法,在精细的基线上取得了重大收益。