Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
翻译:主动语音探测系统(ASD)是分析多对话的重要模块,目的是检测在任何特定时间的视觉场景中发言者或无人在讲话。关于ASD的现有研究对主动发言者的定义并不一致。我们澄清了这项工作的定义,要求声频和视觉语音活动同步。这种定义的澄清是因为我们进行了广泛的实验,我们通过这些实验发现,现有的ASD方法未能模拟视听同步,常常将不同步的视频归类为积极发言。为了解决这一问题,我们建议采用一种跨现代对比学习战略,并在受监督的ASD模型的注意力模块中应用位置编码来利用同步提示。实验结果表明,我们的模型能够成功地检测出非同步的语句,解决当前模型的局限性。