Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects in the audio stream. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene. In support of this new task, we develop a large-scale dataset that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over traditional audio-only methods. Project page: http://vision.cs.utexas.edu/projects/learning-audio-visual-dereverberation.
翻译:从声音反射到表面和环境物体的变异,不仅会降低声音质量以使人们感知,而且会严重影响自动语音识别的准确性。先前曾试图删除仅基于音频模式的反动努力。我们的想法是学习从视听观测中剥离语音。人类演讲者的周围视觉环境揭示了有关房间几何、材料和发言者位置的重要线索,所有这些都影响到音流中准确的反动效应。我们引入了声频(VIDA)视觉反动(VIDA),这是一种端到端方法,根据所观察到的声音和视觉场景学习去除反动。为了支持这一新任务,我们开发了一个大型的数据集,在现实世界3D对家庭进行真实的语音图像扫描,提供各种室声学。展示了我们关于语音增强、语音识别和语音识别的模拟和真实图像的方法,我们展示了它达到最新状态的性能,并大大改进了传统音频-视觉学习方法。项目页面: http://eduview.