Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
翻译:最初用于自然语言处理(NLP)的自然语言处理(NLP),变换模型现在被广泛用于语言处理任务,如语音识别,因为其具有强大的序列建模能力,具有强大的音序建模能力;然而,传统的自省机制最初设计为模拟文本序列,而没有考虑到演讲和发言者建模的特性;此外,对不同的变换变式供发言者识别,没有进行充分的研究;在这项工作中,我们提出了一个新颖的多视角自留机制,并介绍了不同变换机制,对不同变换式进行了经验性研究,无论是否建议对发言者加以识别。具体而言,为了平衡掌握全球依赖性和对地方进行建模的能力,我们为变换者提议了一个多视角自留机制,让不同的注意力负责人参加不同范围的接收字段;此外,我们还采用和比较了五种变换变式变式,与不同的网络结构、嵌入地点和集合方法学习发言者嵌入。 VoxCeleb1 和 VoxCeleb2 数据集的实验结果显示,拟议的多视角自留机制在语音识别语音识别上取得了改进了发言者识别模型的功能,并将发言者网络与优秀的模型进行比较。