We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes representing the same identity share edges between them within a defined temporal window. Nodes within the same video frame are also connected to encode inter-person interactions. Through extensive experiments on the Ava-ActiveSpeaker dataset, we demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance. SPELL outperforms several relevant baselines and performs at par with state of the art models while requiring an order of magnitude lower computation cost.
翻译:我们通过一个名为SPELL的新框架来应对主动语音探测问题,这个框架将学习长程多式联运图,以编码视听数据之间的模式关系。我们把主动语音探测作为了解长期依赖性的节点分类任务。我们首先从视频中构造一个图表,使每个节点与一个人相对应。代表他们之间在确定的时间窗口内相同身份共享边缘的节点。同一视频框架内的节点也与人际互动编码连接。通过对Ava-ApensiSpeaker数据集的广泛实验,我们证明学习基于图表的表达方式,由于其明确的空间和时间结构,大大改进了总体性能。SPELL优于若干相关基线,并且与艺术模型的状态相匹配,同时需要较低的计算成本。