Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and--in a major departure from traditional methods--generalizing to novel environments in a few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir.
翻译:室内脉冲反应(RIR)功能捕捉周围的物理环境如何改变听者听到的声音,从而影响到AR、VR和机器人的各种应用。虽然评估RIRs的传统方法假定整个环境中的密度几何和/或声音测量,但我们探索如何根据在空间观测到的少量图像和回声推断RIRs。为实现这一目标,我们采用了一种基于变压器的方法,利用自我意识建立丰富的声学环境,然后通过交叉注意预测任意查询源接收器位置的RIR。此外,我们设计了一个新颖的培训目标,改进RIR预测和目标之间声学信号的匹配。在3D环境中使用最先进的视听模拟器的实验中,我们证明我们的方法成功地产生了任意RIRs,优于最新状态的方法,并以几发式方式从传统方法概括到新环境的重大偏离。项目:http://vision.cs.utxas.edu/fr_production。