We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources simultaneously (e.g., a person speaking down the hall in a noisy household) and must use its eyes and ears to automatically separate out the sounds originating from the target object within a limited time budget. Towards this goal, we introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time, guided by the improvement in predicted audio separation quality. We demonstrate our approach in scenarios motivated by both augmented reality (system is already co-located with the target object) and mobile robotics (agent begins arbitrarily far from the target object). Using state-of-the-art realistic audio-visual simulations in 3D environments, we demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation. Project: http://vision.cs.utexas.edu/projects/move2hear.
翻译:我们引入了活跃的视听源分离问题, 代理商必须明智地移动, 以便更好地分离来自环境中某个受关注对象的声音。 代理商同时听到多个音频源( 比如, 在一个吵闹的家庭里在大厅里发言的人), 并且必须用其眼睛和耳朵在有限的预算时间内将声音自动分离出目标对象。 为了实现这一目标, 我们引入了一个强化学习方法, 在预测音频分离质量的改善指导下, 长期控制代理商的相机和麦克风定位的移动政策。 我们展示了我们的方法, 其动机是扩大现实( 系统已经与目标目标目标对象合用) 和移动机器人( 代理商开始任意地远离目标对象 ) 。 我们在3D 环境中使用最先进的现实的视听模拟, 我们展示了我们的模型能够找到最小的移动序列, 并给音源分离带来最大回报 。 项目 : http:// vision. cs.utexas.edu/ production/move2hear 。