The time-delay neural network (TDNN) is one of the state-of-the-art models for text-independent speaker verification. However, it is difficult for conventional TDNN to capture global context that has been proven critical for robust speaker representations and long-duration speaker verification in many recent works. Besides, the common solutions, e.g., self-attention, have quadratic complexity for input tokens, which makes them computationally unaffordable when applied to the feature maps with large sizes in TDNN. To address these issues, we propose the Global Filter for TDNN, which applies log-linear complexity FFT/IFFT and a set of differentiable frequency-domain filters to efficiently model the long-term dependencies in speech. Besides, a dynamic filtering strategy, and a sparse regularization method are specially designed to enhance the performance of the global filter and prevent it from overfitting. Furthermore, we construct a dual-stream TDNN (DS-TDNN), which splits the basic channels for complexity reduction and employs the global filter to increase recognition performance. Experiments on Voxceleb and SITW databases show that the DS-TDNN achieves approximate 10% improvement with a decline over 28% and 15% in complexity and parameters compared with the ECAPA-TDNN. Besides, it has the best trade-off between efficiency and effectiveness compared with other popular baseline systems when facing long-duration speech. Finally, visualizations and a detailed ablation study further reveal the advantages of the DS-TDNN.
翻译:时间延迟神经网络(TDNN)是文本无关说话人验证领域的最先进模型之一。然而,传统的TDNN难以捕捉全局上下文,而且在许多最近的研究中已经证明全局上下文对于强大的说话人表示和长时间说话人验证至关重要。此外,常见的解决方案(例如自我关注)对于输入令牌的平方复杂度,使其在应用于具有大量特征图的TDNN时具有计算成本过高的问题。为了解决这些问题,我们提出了TDNN全局滤波器,它应用了对数线性复杂度FFT / IFFT和一组可微分的频域滤波器,以有效地模拟语音中的长期依赖关系。此外,还特别设计了动态过滤策略和稀疏正则化方法,以增强全局过滤器的性能并防止其过度拟合。此外,我们构建了双流TDNN(DS-TDNN),将基本通道分割以降低复杂度,并使用全局滤波器以提高识别性能。在Voxceleb和SITW数据库上的实验证明,与ECAPA-TDNN相比,DS-TDNN在性能方面提高了约10%,并且复杂度和参数分别下降了28%和15%。此外,在面对长时间语音时,它具有最佳的效率和有效性权衡。最后,可视化和详细的剔除研究进一步揭示了DS-TDNN的优势。