Measuring the acoustic characteristics of a space is often done by capturing its impulse response (IR), a representation of how a full-range stimulus sound excites it. This work generates an IR from a single image, which can then be applied to other signals using convolution, simulating the reverberant characteristics of the space shown in the image. Recording these IRs is both time-intensive and expensive, and often infeasible for inaccessible locations. We use an end-to-end neural network architecture to generate plausible audio impulse responses from single images of acoustic environments. We evaluate our method both by comparisons to ground truth data and by human expert evaluation. We demonstrate our approach by generating plausible impulse responses from diverse settings and formats including well known places, musical halls, rooms in paintings, images from animations and computer games, synthetic environments generated from text, panoramic images, and video conference backgrounds.
翻译:测量空间的声学特性往往通过捕捉其脉冲反应(IR)来进行,这是全程刺激的声振反应的表示。这项工作从一个图像中产生一个IR,然后可以应用到其他信号中,使用卷变,模拟图像中显示的空间的反动特性。录制这些IR,既耗时又昂贵,而且对于无法进入的地点往往不可行。我们使用一个端到端的神经网络结构来从单一的声响环境图像中产生可信的声动反应。我们通过比较地面真实数据和人类专家评估来评估我们的方法。我们展示了我们的方法,从各种环境和格式中产生可信的脉动反应,包括众所周知的地点、音乐厅、绘画室、动画和计算机游戏的图像、文本、全景图像和视频会议背景产生的合成环境。