Audio-visual speaker detection aims to identify the active speaker in videos by leveraging complementary audio and visual cues. Existing methods often suffer from computational inefficiency or suboptimal performance due to joint modeling of temporal and speaker interactions. We propose D$^{2}$Stream, a decoupled dual-stream framework that separates cross-frame temporal modeling from within-frame speaker discrimination. Audio and visual features are first aligned via cross-modal attention, then fed into two lightweight streams: a Temporal Interaction Stream captures long-range temporal dependencies, while a Speaker Interaction Stream models per-frame inter-person relationships. The temporal and relational features extracted by the two streams interact via cross-attention to enrich representations. A lightweight Voice Gate module further mitigates false positives from non-speech facial movements. On AVA-ActiveSpeaker, D$^{2}$Stream achieves a new state-of-the-art at 95.6% mAP, with 80% reduction in computation compared to GNN-based models and 30% fewer parameters than attention-based alternatives, while also generalizing well on Columbia ASD. Source code is available at https://anonymous.4open.science/r/D2STREAM.
翻译:音视频说话人检测旨在通过利用互补的音频和视觉线索来识别视频中的活跃说话人。现有方法由于对时序交互和说话人交互进行联合建模,常面临计算效率低下或性能欠佳的问题。我们提出了D$^{2}$Stream,一种解耦的双流框架,它将跨帧的时序建模与帧内的说话人判别分离开来。音频和视觉特征首先通过跨模态注意力进行对齐,然后输入到两个轻量级流中:时序交互流捕获长程时序依赖,而说话人交互流则建模每帧内的人际关系。两个流提取的时序特征和关系特征通过交叉注意力进行交互,以丰富表征。一个轻量级的语音门控模块进一步抑制了来自非语音面部动作的误报。在AVA-ActiveSpeaker数据集上,D$^{2}$Stream以95.6%的mAP取得了新的最优性能,与基于GNN的模型相比计算量减少了80%,与基于注意力的替代方案相比参数减少了30%,同时在Columbia ASD数据集上也表现出良好的泛化能力。源代码可在 https://anonymous.4open.science/r/D2STREAM 获取。