探索长期的时地特征用于视听活跃音响器探测 (Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection)

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5\% and 2.2\% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: \textcolor{magenta}{\url{https://github.com/TaoRuijie/TalkNet_ASD}}.

翻译：主动语音探测(ASD)试图探测在一个或多个发言者的视觉场景中谁在讲话。成功的ASD取决于对短期和长期视听信息以及视听互动的准确解释。与以前系统利用短期特征即时作出决定的工作不同,我们提议了一个名为TalkNet的新框架,通过考虑短期和长期特征来作出决定。TalkNet包括用于地貌表现的视听时间编码器、用于不同时尚互动的视听交叉注意机制,以及获取长期发言证据的自我注意机制。实验表明TalkNet在AVA-ApectiveSpeaker数据集和Columbia ASD数据集方面分别取得了3.5 ⁇ 和2.2 ⁇ 的改进。代码公布在以下网址上:https://github.com/TaoRuijie/TalkNet_ASD ⁇ 。

相关内容

Magenta

关注 0

Magenta is a Google Brain project to ask and answer the questions, “Can we use machine learning to create compelling art and music? If so, how? If not, why not?”

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

【CVPR 2021】姿态可控的语音驱动说话人脸

专知会员服务

16+阅读 · 2021年5月13日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【CVPR2020-UBC】改进小样本学习视觉分类，Few-Shot Visual Classification

专知会员服务

68+阅读 · 2020年2月25日