We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.
翻译:我们引入视觉声学匹配任务, 将音频剪辑转换成像目标环境中的音频。 根据目标环境的图像和源音频的波形, 目标是重新合成音频, 以与其可见几何和材料建议的目标室声学相匹配。 为了应对这项新颖的任务, 我们提出一个跨模式变压器模型, 使用视听注意力将视觉特性注入音频, 并产生现实的音频输出。 此外, 我们设计了一个自监培训目标, 可以从网上视频中学习声频匹配, 尽管它们缺乏声频错配的音频。 我们展示我们的方法成功地将人类的言词转换为图像中描述的各种真实世界环境, 其表现优于传统的声频匹配和更加严密监视的基线 。