Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss. All the resources of this project will be made available at: https://github.com/fuankarion/end-to-end-asd.
翻译:动态发言人探测(ASD) 问题的最新进展基于一个两个阶段的过程: 特征提取和时空背景汇总。 在本文中, 我们提议一个端到端的 ASD 工作流程, 以共同学习特征学习和背景预测。 我们的端到端可培训网络同时学习多模式嵌入和集成 spatio- 时空背景。 这导致在 ASD 任务中更合适的特征表现和性能的改善。 我们还引入了间断图形神经网络块, 将传递的信息按照 ASD 问题的主要背景源进行分割。 实验显示, iGNN 区块的汇总功能更适合 ASD, 从而产生艺术状态性能。 最后, 我们设计了一个薄弱的监控策略, 这表明ASD 问题也可以通过使用视听数据来应对, 但只能依靠音频说明。 我们通过模拟音频信号和可能的音频源( 音频源) 之间的直接关系来实现这一目标, 并引入对比性损失。 实验显示, iGNND 区块的所有资源将可在 http:// qard- / aqualqualto 。