The time-delay neural network (TDNN) is one of the state-of-the-art models for text-independent speaker verification. However, it is difficult for conventional TDNN to capture global context that has been proven critical for robust speaker representations and long-duration speaker verification in many recent works. Besides, the common solutions, e.g., self-attention, have quadratic complexity for input tokens, which makes them computationally unaffordable when applied to the feature maps with large sizes in TDNN. To address these issues, we propose the Global Filter for TDNN, which applies log-linear complexity FFT/IFFT and a set of differentiable frequency-domain filters to efficiently model the long-term dependencies in speech. Besides, a dynamic filtering strategy, and a sparse regularization method are specially designed to enhance the performance of the global filter and prevent it from overfitting. Furthermore, we construct a dual-stream TDNN (DS-TDNN), which splits the basic channels for complexity reduction and employs the global filter to increase recognition performance. Experiments on Voxceleb and SITW databases show that the DS-TDNN achieves approximate 10% improvement with a decline over 28% and 15% in complexity and parameters compared with the ECAPA-TDNN. Besides, it has the best trade-off between efficiency and effectiveness compared with other popular baseline systems when facing long-duration speech. Finally, visualizations and a detailed ablation study further reveal the advantages of the DS-TDNN.
翻译:时间延迟神经网络(TDNN)是文本无关说话人验证的最先进模型之一。然而,传统TDNN难以捕捉到全局上下文,而这在许多最近的研究中已经被证明对于稳健的说话人表示和长时说话人验证非常关键。此外,常见的解决方案,例如自我关注,对于输入令牌的复杂度具有二次复杂度,这使得当应用于具有大尺寸特征映射的TDNN中时,它们的计算成本不可承受。为了解决这些问题,我们提出了TDNN的全局滤波器。该滤波器采用对数线性复杂度FFT / IFFT以及一组可微分的频域滤波器,以有效地模拟语音中的长期依赖关系。此外,特别设计了动态滤波策略和稀疏规则化方法来增强全局滤波器的性能并防止过度拟合。此外,我们构建了双流TDNN (DS-TDNN),它将基本通道拆分以减少复杂性,并采用全局滤波器来增加识别性能。在Voxceleb和SITW数据库上的实验表明,DS-TDNN在与ECAPA-TDNN相比,具有近10%的改进,复杂度和参数下降超过28%和15%。此外,当面对长时间语音时,它具有最佳的效率和效果之间的权衡,相比其他流行的基线系统。最后,可视化和详细的消融研究进一步揭示了DS-TDNN的优势。