It has already been observed that audio-visual embedding can be extracted from these two modalities to gain robustness for person verification. However, the aggregator that used to generate a single utterance representation from each frame does not seem to be well explored. In this article, we proposed an audio-visual network that considers aggregator from a fusion perspective. We introduced improved attentive statistics pooling for the first time in face verification. Then we find that strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, fuse the modality with a gated attention mechanism. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18\%, 0.27\%, and 0.49\% EER on three official trail lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification. As an analysis, visualization maps are generated to explain how this system interact between modalities.
翻译:人们已经注意到,从这两个模式中可以提取视听嵌入,以获得个人核查的稳健性;然而,似乎没有很好地探讨用于从每个框架生成单一话语代表的聚合器;在本条中,我们提议建立一个视听网络,从聚合角度考虑聚合器;我们首次在面对面核查中引入了经改进的注意的集合统计数据;然后我们发现,在集合过程中,各种模式之间存在密切的关联性,因此建议联合注意集合,包含周期一致性,以学习隐含的跨框架重量;最后,将模式与封闭式关注机制结合起来。所有拟议模式都经过了VoxCeleb2 dev数据集的培训,最佳系统在VoxCeleb1的三个正式路径清单上分别获得了0.18 ⁇ 、0.27 ⁇ 和0.49 ⁇ EER。据我们所知,这三张VoxCeleb1的官方路径列表是最佳公布的个人核查结果。作为分析,生成了可视化地图,以解释这一系统如何在模式之间相互作用。