Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5\% and 2.2\% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: \textcolor{magenta}{\url{https://github.com/TaoRuijie/TalkNet_ASD}}.
翻译:主动语音探测(ASD)试图探测在一个或多个发言者的视觉场景中谁在讲话。成功的ASD取决于对短期和长期视听信息以及视听互动的准确解释。与以前系统利用短期特征即时作出决定的工作不同,我们提议了一个名为TalkNet的新框架,通过考虑短期和长期特征来作出决定。TalkNet包括用于地貌表现的视听时间编码器、用于不同时尚互动的视听交叉注意机制,以及获取长期发言证据的自我注意机制。实验表明TalkNet在AVA-ApectiveSpeaker数据集和Columbia ASD数据集方面分别取得了3.5 ⁇ 和2.2 ⁇ 的改进。代码公布在以下网址上:https://github.com/TaoRuijie/TalkNet_ASD ⁇ 。