We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple time-varying audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, improving its own estimates for past timesteps via self-attention. Using highly realistic acoustic SoundSpaces simulations in real-world scanned Matterport3D environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target. Project: https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/.
翻译:我们探索动态声源的积极视听分离,在3D环境中,一个装饰剂以智能方式移动,不断分离由某个对象排放的时间变化的音频流。该剂听到多种时间变化的音频源混合流(例如多人交谈和一个乐队在吵闹的聚会上播放音乐)。鉴于时间有限,它需要利用以自我为中心的视听观测来提取目标声音。我们提议一个强化学习剂,配备一个新型变压器记忆,学习运动政策,以控制其相机和麦克风,以恢复动态目标音频,通过自我注意改进其自身对过去时段的估算。在现实世界扫描的Mineport3D环境中使用高度现实的声响空间模拟,我们显示我们的模型能够学习有效的行为,以持续分离一个时间变化的音频目标。项目:https://vision.cs.utexas.edu/project/active-av-hildal-sepepaction/。