Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
翻译:基于语音的图像检索已被研究为联合代表学习的一种替代方法,通常不强调检索本身。因此,尚不清楚基于语音的检索在实践中如何有效,无论是在绝对意义上,还是相对于将自动语音识别(ASR)与强有力的文字编码器相结合的替代战略而言,都是如此。在这项工作中,我们广泛研究并扩大了编码器结构、培训方法(包括单式和多式联运预培训)和其他因素的选择。我们的实验涵盖了三种数据集中的不同类型的语音:Flickr 音频、 Pages 音频和地方化叙事。我们的最佳模型配置在艺术状态上取得了巨大收益,例如,将回回调从21.8%提高到31.2%,将Flickr音频调频调频调频调频从27.6%提高到53.4%。我们还展示了我们最好的语音模型在自发、重音或自动调音调困难时,可以匹配或超过升级的ASR至文字编码。