We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using self-attention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces simulations in real-world scanned Matterport3D environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a dynamic audio target. Project: https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/.
翻译:我们探索动态声源的积极视听分离,在3D环境中,一个装饰剂以智能方式移动,不断分离由某个对象排放的时间变化音频流。该剂听到多种声频源的混合流(例如多人交谈和在吵闹的聚会上播放音乐乐队)。鉴于预算有限,它需要利用以自我为中心的视听观测,在每一步都准确地提取目标声音。我们提议一个强化学习剂,配备新的变压器内存,学习运动政策以控制其相机和麦克风,以恢复动态目标音频,利用自我意识为当前时序作出高质量的估计,并同时改进其过去的估计。在现实世界扫描Mealport3D环境中使用高度现实的声频空间模拟,我们显示我们的模型能够学习有效的行为,以持续分离动态音频目标。项目:https://vision.cs.utexas.edu/project/active-av-hildal-septraation/。