We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, mapping, source localization and separation, and acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the advantages of allowing continuous spatial sampling, generalization to novel environments, and configurable microphone and material properties. To our best knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. We showcase the simulator's properties and benchmark its performance against real-world audio measurements. In addition, through two downstream tasks covering embodied navigation and far-field automatic speech recognition, highlighting sim2real performance for the latter. SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.
翻译:我们引入了“声音空间2.0”(HoundSpaces 2.0),这是一个用于为 3D 环境提供基于飞行的音频转换的平台。根据现实环境的3D网格,“声音空间”可以产生高度现实的声学,用于从任意的麦克风地点捕捉任意的声音。它与现有的3D视觉资产一道,支持一系列视听研究任务,如视听导航、绘图、源地定位和分离以及声学匹配。与现有资源相比,“声音空间2.0”具有允许连续空间取样、对新环境进行概括化以及可配置的麦克风和材料特性的优势。据我们所知,这是第一个基于几何的声学模拟,它既能提供高度的忠诚和真实性,又能足以用于体现的学习。我们展示模拟器的特性,并根据现实世界音学测量量衡量标准衡量其性能基准。此外,通过两项下游任务,包括体现的导航和远场自动语音识别,突出后者的模拟性能。“SoundSoundSpaceSulenceS 2.0”可以公开推动对视觉系统进行更广泛的研究。