The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale frequency-channel attention (MFA), where we characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN. We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and computation complexity. Further, the MFA mechanism is found to be effective for speaker verification with short test utterances.
翻译:时间延迟神经网络(TDNNN)是针对依赖文字的发言者进行校验的最先进的神经解决方案之一,然而,它需要大量的过滤器来捕捉任何本地频率区域的发言者特点;此外,这些系统的性能在短话情况下可能会退化;为了解决这些问题,我们建议采用多尺度的频率频道关注(MFA),通过由动态神经网络和TDNN组成的新型双向设计来描述不同规模的发言者特点。我们评估了VoxCeleb数据库中拟议的MFA, 并观察到与MFA的拟议框架可以在降低参数和计算复杂性的同时实现最新性能。此外,MFA机制被认为对以短期测试语言进行语音核查是有效的。