Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the back-translation in unsupervised MT, we minimize the semantic discrepancies between origin sentences and back-translated sentences to further improve the noise robustness of the textual encoder. Extensive experiments are conducted on three video-text and image-text cross-modal retrieval benchmarks across different languages, and the results demonstrate that our method significantly improves the overall performance without using extra human-labeled data. In addition, equipped with a pre-trained visual encoder from a recent vision-and-language pre-training framework, i.e., CLIP, our model achieves a significant performance gain, showing that our method is compatible with popular pre-training models. Code and data are available at https://github.com/HuiGuanLab/nrccr.
翻译:尽管在跨模式检索领域最近有所发展,但由于缺乏人工附加说明的数据集,对低资源语言的研究较少,因为缺少人工附加说明的数据集,对低资源语言的研究较少。在本文件中,我们建议为低资源语言采用噪音-机器人跨语言跨语言跨模式检索方法。为此,我们使用机器翻译(MT)为低资源语言建造假单句配对。然而,由于MT不完美,它往往在翻译过程中引入噪音,使文本嵌入变坏,从而损害检索的性能。为了减轻这一影响,我们采用了多视图自动蒸馏方法来学习噪音-机器人目标语言的演示。我们采用了一个交叉注意模块来生成软假目标,从类似观点和基于地貌的观点进行直接监督。此外,由于未加固的MT的背译,我们尽量缩小了原句和原句之间的语型模型差异,从而进一步提高了文本编码的稳健性能性。我们用三种具有可调的视频-机器人目标语言进行广泛的自我蒸馏,并且用我们最新的视觉-视频-视频-视频-视频-GL 测试,在不使用不同的数据格式检索方法,在使用不同的工具上展示了我们的数据前的超文本/变压工具。