Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and -- in a major departure from traditional methods -- generalizing to novel environments in a few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir.
翻译:室内脉冲反应(RIR)功能捕捉周围的物理环境如何改变听者听到的声音,从而影响到AR、VR和机器人的各种应用。虽然评估RIRs的传统方法假定在整个环境中进行密集的几何测量和/或声音测量,但我们探索如何根据在空间观测到的少量图像和回声推断RIRs。为了实现这一目标,我们采用了一种以变压器为基础的方法,利用自我意识来建立丰富的声学环境,然后通过交叉注意预测任意查询源接收器位置的RIR。此外,我们设计了一个新颖的培训目标,改进RIR预测和目标之间声学信号的匹配。在使用3D环境中最先进的视听模拟器的实验中,我们证明我们的方法成功地产生了任意的RIRs,优于最先进的状态方法,在与传统方法大偏离时 -- -- 以几发式的方式向新环境进行概括。项目: http://visit.cs.utexex/edualis_rir。