Audio-visual speaker diarization aims at detecting ``who spoken when`` using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To create a testbed that can effectively compare diarization methods on videos in the wild, we annotate the speaker diarization labels on the AVA movie dataset and create a new benchmark called AVA-AVD. This benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. Yet, how to deal with off-screen and on-screen speakers together still remains a critical challenge. To overcome it, we propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility. Experiments have shown that our method not only can outperform state-of-the-art methods but also is more robust as varying the ratio of off-screen speakers. Ablation studies demonstrate the advantages of the proposed AVR-Net and especially the modality mask on diarization. Our data and code will be made publicly available.
翻译:视听发言人的夸度旨在检测在使用听觉和视觉信号时发言的人; 现有的视听二分化数据集主要侧重于室内环境,如会议室或新闻工作室,在电影、纪录片和观众喜剧等许多情景中与现场录像大不相同; 为了创建一个能够有效地比较野外录像的分化方法的试样床, 我们给AVA电影数据集上发言者的分化标签作笔记, 并创建一个称为AVA-AVD的新基准。 这个基准具有挑战性,因为各种场景、复杂的声学条件和完全不在屏幕上的演讲者。 然而,如何共同处理屏幕外和屏幕上的演讲者仍是一个关键的挑战。 要克服这个挑战,我们建议建立一个新型的音频-视觉联系网络(AVR-Net), 引入一个有效的模式遮掩, 以通过可见度来捕捉歧视信息。 实验显示, 我们的方法不仅可以超越最先进的状态方法,而且由于屏幕外演讲者的比例不同,这个基准也更加有力。 我们的模拟研究将特别展示A-R 数据模式的优势。