Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may appear in difficult viewing angles, and there may be occlusions, visual clutter, audio noise, and bad lighting. Under these conditions, previous state-of-the-art active speaker detection methods do not give satisfactory results. Instead, we tackle the problem from a new setting using both video and multi-channel microphone array audio. We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results. In contrast to previous methods, our method localizes active speakers from all possible directions on the sphere, even outside the camera's field of view, while simultaneously detecting the device wearer's own voice activity. Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.
翻译:强化的现实装置有可能提高人的认知力,并在复杂的谈话环境中促成其他辅助功能。 有效捕捉理解这些社交互动所必需的视听环境首先需要探测和定位设备磨损器和周围人群的语音活动。 这些任务具有挑战性,因为它们的自我中心性质:磨损器的头部运动可能会引起运动模糊,周围的人可能出现在困难的视觉角度,周围的人可能会出现,并且可能存在隔离、视觉屏蔽、声音噪音和坏照明。在这些条件下,以往最先进的主动扬声器探测方法不会产生令人满意的结果。 相反,我们用视频和多声道麦克风阵列的音频从新环境解决问题。我们提出一种新的端到端的深层次学习方法,能够提供强有力的语音活动探测和定位结果。 与以往的方法不同,我们的方法将活跃的发言者从球场上的所有可能的方向,甚至是摄影场外,都本地化,同时探测设备磨损器本身的语音活动。 我们的实验显示,提议的方法可以产生更优越的结果,可以实时运行,并且能够抵御噪音和动态。