It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.
翻译:现已从各种研究中确定,将视频和音频数据相结合对探测活跃的发言者有很大好处,然而,这两种模式中的任何一种都有可能诱导不可靠或欺骗性信息,从而误导视听融合,本文概述了积极演讲者发现是一个多目标学习问题,目的是利用一种全新的自我意识、基于不确定性的多式联运融合计划,最佳利用每一种模式。获得的结果表明,拟议的多目标学习结构在改进 mAP 和 AUC 分数方面优于传统方法。我们进一步表明,我们的聚合战略在积极语音检测方面超过了在不同学科中报告的其他模式融合方法。我们最后表明,拟议的方法大大改进了AVA-A-ApentSpeaker数据集的最新技术。