Time Delay Neural Network (TDNN) is a well-performing structure for DNN-based speaker recognition systems. In this paper we introduce a novel structure Crossed-Time Delay Neural Network (CTDNN) to enhance the performance of current TDNN. Inspired by the multi-filters setting of convolution layer from convolution neural network, we set multiple time delay units each with different context size at the bottom layer and construct a multilayer parallel network. The proposed CTDNN gives significant improvements over original TDNN on both speaker verification and identification tasks. It outperforms in VoxCeleb1 dataset in verification experiment with a 2.6% absolute Equal Error Rate improvement. In few shots condition CTDNN reaches 90.4% identification accuracy, which doubles the identification accuracy of original TDNN. We also compare the proposed CTDNN with another new variant of TDNN, FTDNN, which shows that our model has a 36% absolute identification accuracy improvement under few shots condition and can better handle training of a larger batch in a shorter training time, which better utilize the calculation resources. The code of the new model is released at https://github.com/chenllliang/CTDNN
翻译:时间延迟神经网络(TDNNN)是DNN的语音识别系统的一个良好结构。 在本文中,我们引入了一个新的结构跨时延迟神经网络(CTDNNN),以提高目前的TDNN的性能。在来自卷发神经网络的卷变层多过滤器设置的启发下,我们为底层设置了多个背景大小不同的时间延迟单位,并建立了一个多层平行网络。拟议的CTDNN在语音验证和识别任务方面都比原来的TDNN显著改进了。它在核查实验中比VoxCeleb1的数据集高出2.6 % 绝对平均误差率的测试。在少数镜头条件下,CTDNN达到90.4%的识别准确度,这是最初TDNNN的识别精度的两倍。我们还将拟议的CTDNNN与另一个新的变式(TDNNN,FTDNN)进行比较,这表明我们的模型在少数镜头条件下有36%的绝对识别精确度改进,并且能够更好地处理在较短的培训时间里更大批次的培训,从而更好地利用计算资源。新模型的代码在https://gnuthub.com/chillillings。