Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net) which introduces a simple yet effective modality mask to capture discriminative information based on face visibility. Experiments show that our method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers. Our data and code has been made publicly available at https://github.com/showlab/AVA-AVD.
翻译:视听演讲者辞呈旨在利用听觉和视觉信号探测“谁说话”的视听演讲人; 现有的视听二分数据集主要侧重于室内环境,如会议室或新闻工作室,它们与电影、纪录片和观众喜剧等许多场景中的现场录像有很大不同; 为开发这些富有挑战性的视频的分化方法, 我们创建了AVA视听分解( AVA- AVAD) 数据集。 我们的实验表明, 将AVA- AVD 添加到培训组中, 不仅能够产生更佳的虚拟视频中的分化模型, 尽管数据相对较少。 此外, 由于各种场景、复杂的声学条件和完全不在屏幕上的演讲者,这一基准具有挑战性。 作为应对挑战的第一步, 我们设计了视听联系网络(AVAR-Net), 引入一个简单而有效的模式掩码, 以根据面能见度获取歧视性信息。 实验显示, 我们的方法不仅能够超越最优美的状态方法,而且更加可靠,而且由于离屏幕的发言者比例不同,我们的数据和AVAVA/ 公开制作的数据和AVADC。