In this paper, we propose a deep learning based multi-speaker direction of arrival (DOA) estimation with audio and visual signals by using permutation-free loss function. We first collect a data set for multi-modal sound source localization (SSL) where both audio and visual signals are recorded in real-life home TV scenarios. Then we propose a novel spatial annotation method to produce the ground truth of DOA for each speaker with the video data by transformation between camera coordinate and pixel coordinate according to the pin-hole camera model. With spatial location information served as another input along with acoustic feature, multi-speaker DOA estimation could be solved as a classification task of active speaker detection. Label permutation problem in multi-speaker related tasks will be addressed since the locations of each speaker are used as input. Experiments conducted on both simulated data and real data show that the proposed audio-visual DOA estimation model outperforms audio-only DOA estimation model by a large margin.
翻译:在本文中,我们建议采用无变换损失功能,用音频和视觉信号来进行基于深学习的多声频到达方向(DOA)估算。我们首先收集多式声源本地化数据集(SSL),其中音频和视觉信号都记录在真实的家用电视情景中。然后我们提出一种新的空间说明方法,根据针眼相机模型,通过照相机坐标和像素坐标之间的转换,为每个发言者提供DOA地面真象数据。如果空间位置信息与声学特征一起作为另一种输入,多声频调DOA估计可以作为主动语音探测的分类任务加以解决。多声源相关任务中的Label变异问题将得到解决,因为每个发言者的位置都被用作投入。对模拟数据和真实数据进行的实验表明,拟议的DOA估计模型比大型的仅音频的DA估计模型要高出。