Despite the success of multimodal learning in cross-modal retrieval task, the remarkable progress relies on the correct correspondence among multimedia data. However, collecting such ideal data is expensive and time-consuming. In practice, most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs. Training on such noisy correspondence datasets causes performance degradation because the cross-modal retrieval methods can wrongly enforce the mismatched data to be similar. To tackle this problem, we propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores. We view a binary classification task as the meta-process that encourages the MSCN to learn discrimination from positive and negative meta-data. To further alleviate the influence of noise, we design an effective data purification strategy using meta-data as prior knowledge to remove the noisy samples. Extensive experiments are conducted to demonstrate the strengths of our method in both synthetic and real-world noises, including Flickr30K, MS-COCO, and Conceptual Captions.
翻译:尽管多模态学习在跨模态检索任务中取得了成功,但这一显著的进展依赖于多媒体数据之间的准确对应。然而,采集这样理想的数据既耗时又昂贵。在实践中,大部分广泛使用的数据集都是从互联网中收集而来的,且不可避免地含有不匹配的对。在这种噪声对应数据集上训练会导致性能下降,因为跨模态检索方法可能会错误地使不匹配的数据相似。为了解决这个问题,我们提出了一个元相似性校正网络(MSCN)来提供可靠的相似度分数。我们将二元分类任务视为元过程,从正负的元数据中学习区分度,以鼓励MSCN提供可靠的相似度得分。为了进一步减轻噪声的影响,我们使用元数据作为先验知识设计了一种有效的数据净化策略,以去除噪声样本。我们进行了广泛的实验,以展示我们的方法在合成和现实世界的噪声中的优点,包括Flickr30K, MS-COCO和Conceptual Captions。