Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8\% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.
翻译:能够代表并描述环境声景的机器具有实际潜力, 例如, 用于音频标签和字幕系统。 常用的学习范式一直依赖平行的音频文本数据, 但是在网络上几乎无法找到这些数据。 我们提议了 VIP- ANT, 无需使用任何平行的音频文本数据, 就可以诱发\ textbf{A}A} udio- textbf{T}T} 文本对齐。 我们的关键想法是将双模版图像- 文字表达和双模版图像- audios 表达方式之间的图像模式共享; 图像模式功能是连接三模版存储空间的音频和文本。 在没有配对的音频文本数据的困难零发式设置中, 我们的模型展示了 ESC50 和 US8K 音频分类工作的最新零弹道表现, 甚至超过 Clotho 音频字幕检索( 有音频查询) 的监管状态, R@1. 我们进一步调查了最起码的音文本监督案例, 例如, 只有几百张的音频- tal- dal- dal- passal- passal- lax lax ex 需要我们在 0. 8xxxxxxxxxxxxx 需要我们新的音频- texxx 。