Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets, the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows), we show that a simple late fusion of the two approaches enhances the active speaker detection performance.
翻译:在视频中积极探测在视频中将源面与音频模式中的基本演讲词联系起来的视频地址中的发言者积极显示,两种主要信息来源是:(一) 视觉活动及其与语音信号的互动;(二) 不同形式,以面貌和语言的形式共同发现发言者的身份。两种方法都有其局限性:视听活动模式与其他经常发生的声音活动(如笑和咀嚼)混为一谈,而发言者基于身份的方法仅限于具有足够模糊信息以建立语音组合的视频。由于这两种方法是独立的,我们调查其互补性质。我们提出了一个新的、不受监督的框架,用以指导发言者的跨模式身份联系和声像活动,以便积极探测发言者。通过对两个基准数据集的娱乐媒体视频的实验,即AVA活跃的演讲人(movies)和视觉人集群数据集(TV节目),我们显示,两种方法的简单延迟融合会提高积极语音探测性。