In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.
翻译:在本文中,我们探索了从教学视频中学习的自我监督的视听模型。先前的工作表明,这些模型在经过关于大规模视频数据集的培训后,可以将口头语言和声音与视觉内容联系起来,但仅接受了英语视频培训和评价。为了学习多语种音像演示,我们建议采用级联方法,利用经过培训的英语视频模型,并将其应用于日本视频等其他语言的视听数据。我们采用级联方法,显示检索性能近10倍,而仅对日语视频进行培训。我们还将经过英语视频培训的模型应用到日语和印地语图像的口头字幕中,从而实现最先进的表现。