Most state-of-the-art Deep Learning systems for speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention pooling, which extends our previous approach based on Self Multi-Head Attention. An additional self attention layer is added to the pooling layer that summarizes the context vectors produced by Multi-Head Attention into a unique speaker representation. This method enhances the pooling mechanism by giving weights to the information captured for each head and it results in creating more discriminative speaker embeddings. We have evaluated our approach with the VoxCeleb2 dataset. Our results show 6.09% and 5.23% relative improvement in terms of EER compared to Self Attention pooling and Self Multi-Head Attention, respectively. According to the obtained results, Double Multi-Head Attention has shown to be an excellent approach to efficiently select the most relevant features captured by the CNN-based front-ends from the speech signal.
翻译:用于语音校验的大多数最先进的深层学习系统都以扩音器嵌入提取器为基础。 这些架构通常由地物提取器前端和一个集合层组成, 将变长的音量编码成固定长的扬声器矢量。 在本文中, 我们展示了双倍多发注意集合, 扩大了我们先前基于自多发注意的方法。 在将多发注意生成的环境矢量汇总成一个独特的扬声器代表的集合层中增加了一个额外的自我注意层。 这个方法通过给每个头部所捕到的信息权重来增强集合机制, 并导致创建更具歧视性的扬声器嵌入器。 我们用 VoxCeleb2 数据集评估了我们的方法。 我们的结果显示, EER 与自发注意集合和自发多发注意相比,分别有6.09%和5.23%的相对改进。 根据所获得的结果, 双倍多发注意显示, 高效选择CNN前端从语音信号中捕捉到的最相关特征的极好方法。