发言者核查的双重多方负责人关注 (Double Multi-Head Attention for Speaker Verification)

Most state-of-the-art Deep Learning systems for speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention pooling, which extends our previous approach based on Self Multi-Head Attention. An additional self attention layer is added to the pooling layer that summarizes the context vectors produced by Multi-Head Attention into a unique speaker representation. This method enhances the pooling mechanism by giving weights to the information captured for each head and it results in creating more discriminative speaker embeddings. We have evaluated our approach with the VoxCeleb2 dataset. Our results show 6.09% and 5.23% relative improvement in terms of EER compared to Self Attention pooling and Self Multi-Head Attention, respectively. According to the obtained results, Double Multi-Head Attention has shown to be an excellent approach to efficiently select the most relevant features captured by the CNN-based front-ends from the speech signal.

翻译：用于语音校验的大多数最先进的深层学习系统都以扩音器嵌入提取器为基础。这些架构通常由地物提取器前端和一个集合层组成, 将变长的音量编码成固定长的扬声器矢量。在本文中, 我们展示了双倍多发注意集合, 扩大了我们先前基于自多发注意的方法。在将多发注意生成的环境矢量汇总成一个独特的扬声器代表的集合层中增加了一个额外的自我注意层。这个方法通过给每个头部所捕到的信息权重来增强集合机制, 并导致创建更具歧视性的扬声器嵌入器。我们用 VoxCeleb2 数据集评估了我们的方法。我们的结果显示, EER 与自发注意集合和自发多发注意相比,分别有6.09%和5.23%的相对改进。根据所获得的结果, 双倍多发注意显示, 高效选择CNN前端从语音信号中捕捉到的最相关特征的极好方法。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

从多个自我监督任务中学习问题无关的语音表示，Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

专知会员服务

17+阅读 · 2020年5月6日

所有跨语言嵌入式都应该讲英语吗? | Should All Cross-Lingual Embeddings Speak English?

专知会员服务

7+阅读 · 2020年4月16日

【CVPR2020】用于图像超分辨率的深度展开网络，Deep Unfolding Network for Image Super-Resolution

专知会员服务

44+阅读 · 2020年3月26日

【ICLR2020】胶囊与反向路由点积注意力

专知会员服务

27+阅读 · 2020年2月15日