Binaural audio gives the listener the feeling of being in the recording place and enhances the immersive experience if coupled with AR/VR. But the problem with binaural audio recording is that it requires a specialized setup which is not possible to fabricate within handheld devices as compared to traditional mono audio that can be recorded with a single microphone. In order to overcome this drawback, prior works have tried to uplift the mono recorded audio to binaural audio as a post processing step conditioning on the visual input. But all the prior approaches missed other most important information required for the task, i.e. distance of different sound producing objects from the recording setup. In this work, we argue that the depth map of the scene can act as a proxy for encoding distance information of objects in the scene and show that adding depth features along with image features improves the performance both qualitatively and quantitatively. We propose a novel encoder-decoder architecture, where we use a hierarchical attention mechanism to encode the image and depth feature extracted from individual transformer backbone, with audio features at each layer of the decoder.
翻译:Binaural音频让听众感到自己在录音地点,如果与AR/VR相伴,就会增强消沉体验。但是,二进制音频记录的问题是,它需要专门设置,无法在手持装置内与传统的单声频相比在手持装置内进行编织,而传统的单声频则可以用一个麦克风加以记录。为了克服这一缺陷,先前的工程试图将单声频提升到双声音频,作为视觉输入的后处理步骤。但以前的所有方法都错过了任务所需的其他重要信息,即从录音装置中产生不同声音的物体的距离。在这项工作中,我们认为,现场的深度地图可以作为将现场物体的远程信息编码的代理,并表明与图像特征一道增加深度特征可以提高性能的质和量性。我们建议建立一个新型的摄像器脱色器结构,在其中我们使用一个等级关注机制来编码从单个变压骨中提取的图像和深度特征,并在解码器的每层上加上音频特征。