Active speaker detection is a challenging task in audio-visual scenario understanding, which aims to detect who is speaking in one or more speakers scenarios. This task has received extensive attention as it is crucial in applications such as speaker diarization, speaker tracking, and automatic video editing. The existing studies try to improve performance by inputting multiple candidate information and designing complex models. Although these methods achieved outstanding performance, their high consumption of memory and computational power make them difficult to be applied in resource-limited scenarios. Therefore, we construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%), while the resource costs are significantly lower than the state-of-the-art method, especially in model parameters (1.0M vs. 22.5M, about 23x) and FLOPs (0.6G vs. 2.6G, about 4x). In addition, our framework also performs well on the Columbia dataset showing good robustness. The code and model weights are available at https://github.com/Junhua-Liao/Light-ASD.
翻译:积极检测扬声器是一项具有挑战性的视听情景理解任务,目的是检测在一种或多种扬声器情景下发言的人,这一任务得到了广泛的关注,因为它在诸如扬声器diarization、语音跟踪和自动视频编辑等应用程序中至关重要。现有的研究试图通过输入多个候选信息和设计复杂模型来改进性能。虽然这些方法取得了杰出的性能,但其记忆和计算能力的高消耗量使其难以在资源有限的情景中应用。因此,我们通过减少投入候选人、分解2D和3D声波来提取视听特征,以及应用低计算复杂性的封闭式经常性单元(GRU)来应用跨模式模型。AVA-AviewSpeaSpeaker数据集的实验结果表明,我们的框架实现了竞争性的 mAP性能(94.1%对94.2%),而资源成本大大低于最先进的方法,特别是在模型参数(1.0M vs.225M,约23x)和FLOPs (0.6G vs. 2.6G vs.2.6G) 用于跨模式。</s>