Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech content, similarly to the same video. Our results show that dub-augmented training improves performance on a range of auditory and audiovisual tasks, without significantly affecting linguistic task performance overall. We additionally compare this approach to a strong baseline where we remove speech before pretraining, and find that dub-augmented training is more effective, including for paralinguistic and audiovisual tasks where speech removal leads to worse performance. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance.
翻译:音视频表示学习通常依赖于视听之间的对应关系。然而,通常有多个音频轨道可以与一个视觉场景相对应。例如,在同一拥挤街道上有不同的谈话。这些反事实的对对音视频表示学习的影响尚未得到研究。为了调查这个问题,我们使用电影配音来增强跨模态对比度学习。我们的方法学习将不同的音频替代品,仅在语音内容上有所差异,与相同的视频相似地表示。我们的结果显示,增加配音的训练可以提高各种听觉和音视频任务的性能,而不会对整体语言任务性能产生显著影响。我们还将此方法与一种强大的基线方法进行比较,在此基线方法中,我们在预训练之前删除了语音。结果发现,配音增强训练效果更好,包括对声音和音视频任务的泛化能力和对语音移除效果较差的语音学和音视频任务。这些发现强调了在学习场景级别的音视频对应关系时考虑语音变化的重要性,并表明配音可能是训练音视频模型实现更强大性能的有用增强技术。