This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differences in model behavior and performance between English and non-English settings, attributable to the English-only pre-training of CLIP and HuBERT, and investigate how fine-tuning the pre-trained models impacts these differences. Finally, we show that our models can be used for mono- and cross-lingual speech-text retrieval and cross-lingual speech-speech retrieval, despite never having seen any parallel speech-text or speech-speech data during training.
翻译:本文研究了大规模的英文预训练模型(CLIP和HuBERT)在多语言图像-语音检索中的应用。对于非英语图像-语音检索,我们的表现均优于当前最先进的方法,无论是训练单独的语言模型还是使用处理所有三种语言的单个模型。我们发现了英语和非英语环境下模型行为和性能之间的关键差异,这些差异可以归因于CLIP和HuBERT的英文预训练,还研究了微调预训练模型如何影响这些差异。最后,我们证明了我们的模型可以用于单语和跨语言语音-文本检索以及跨语言语音-语音检索,尽管在训练过程中从未接触过任何平行的语音-文本或语音-语音数据。