This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.
翻译:本文探索了三种创新方法,用多头自留(MSA)机制和记忆层来利用深神经网络(DNNN)来利用多头自留(MSA)机制和记忆层来改进语音校验系统(SV)的性能。 首先,我们提议使用一种叫做类符号的可学习矢量来取代普通全球集合机制来提取嵌入器。 与全球平均集合不同, 我们的提案考虑到了投入的时间结构, 与依赖文本的 SV 任务相关。 类符号在第一个管理服务协议层之前被连接到输入, 其输出状态被用来预测等级。 为了获得更多的稳健性, 我们引入了两种方法。 首先, 我们开发了一种巴耶斯语对类符号的估算。 第二, 我们添加了一种蒸馏代号, 用于培训使用知识蒸馏(KD) 理念的一对一对教师- 网络学生进行学习, 这与类符号相结合。 这个蒸馏符号是用来模拟教师网络的预测, 而其输出状态则用来复制真实的标签。 所有战略都经过了RSR- Part commestal- commestal II II IP 1