The goal of this paper is text-independent speaker verification where utterances come from 'in the wild' videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embeddings are instance-wise. In this paper, we propose Cross Attentive Pooling (CAP) that utilizes the context information across the reference-query pair to generate utterance-level embeddings that contain the most discriminative information for the pair-wise matching problem. Experiments are performed on the VoxCeleb dataset in which our method outperforms comparable pooling strategies.
翻译:本文的目标是在“ 野生” 视频中的语句来自“ 野生” 视频, 且可能包含不相关的信号时, 以文本为主的语句验证。 虽然演讲者验证自然是一个双向问题, 但现有的生成演讲者嵌入器的方法是实例。 在本文中, 我们提议跨强化共享( CAP), 使用跨参考查询配对的上下文信息, 生成包含对对匹配问题最具歧视性信息的语句层嵌入器 。 在 VoxCeleb 数据集上进行了实验, 我们的方法优于可比较的集合策略 。