Audio-visual speaker diarization aims at detecting ``who spoken when`` using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To create a testbed that can effectively compare diarization methods on videos in the wild, we annotate the speaker diarization labels on the AVA movie dataset and create a new benchmark called AVA-AVD. This benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. Yet, how to deal with off-screen and on-screen speakers together still remains a critical challenge. To overcome it, we propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility. Experiments have shown that our method not only can outperform state-of-the-art methods but also is more robust as varying the ratio of off-screen speakers. Ablation studies demonstrate the advantages of the proposed AVR-Net and especially the modality mask on diarization. Our data and code will be made publicly available at https://github.com/zcxu-eric/AVA-AVD.
翻译:视听发言人的辞令旨在探测“在使用听觉和视觉信号时发言的人” ; 现有的视听分区数据集主要侧重于室内环境,如会议室或新闻工作室,在许多情景中,如电影、纪录片和观众喜剧等,它们与现场录像大不相同; 为了创建一个能够有效地比较野外录像的分化方法的试床, 我们给AVA电影数据集上发言者的分化标签作笔记, 并创建一个称为AVA-AVD的新基准。 由于场景多种多样,声音条件复杂,而且演讲者完全不在屏幕之外,这一基准具有挑战性。 然而,如何与屏幕外和屏幕上的发言者一起打交道仍是一个关键的挑战。 要克服这一挑战,我们建议建立一个新型的音频-视觉联系网络(AVR-Net ), 引入一个有效的模式掩码来根据可见度捕捉歧视信息。 实验表明,我们的方法不仅能够超越常规/艺术方法,而且由于屏幕外演讲者的比例不同而更加有力。 Ablational学研究将特别展示AVA-R 数据库的优势。