It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.
翻译:人们已经注意到,视听嵌入比个人核查的单一模式嵌入更加强大。 在这里,我们提出了一个新的视听战略,从聚合角度考虑聚合器。 首先,我们首次在面对面核查中引入了加权强化的注意统计集合。我们发现,在集合过程中,各种模式之间有着很强的关联,因此建议联合关注集合包含周期一致性,以学习隐含的跨框架重量。最后,每种模式都与封闭式关注机制相结合,以获得稳健的视听嵌入。所有拟议模式都经过了VoxCeleb2 dev数据集的培训,而最佳系统在VoxCeleb1的三个正式试验名单上分别获得了0.18%、0.27%和0.49%的EER,据我们所知,这三张VoxCeleb1的官方试验名单是人核查的最佳结果。