State-of-the-art speaker verification frameworks have typically focused on developing models with increasingly deeper (more layers) and wider (number of channels) models to improve their verification performance. Instead, this paper proposes an approach to increase the model resolution capability using attention-based dynamic kernels in a convolutional neural network to adapt the model parameters to be feature-conditioned. The attention weights on the kernels are further distilled by channel attention and multi-layer feature aggregation to learn global features from speech. This approach provides an efficient solution to improving representation capacity with lower data resources. This is due to the self-adaptation to inputs of the structures of the model parameters. The proposed dynamic convolutional model achieved 1.62\% EER and 0.18 miniDCF on the VoxCeleb1 test set and has a 17\% relative improvement compared to the ECAPA-TDNN using the same training resources.
翻译:最先进的发言者核查框架通常侧重于开发各种模型,其深度(更多层)和广度(渠道数目)模型,以提高其核查绩效;相反,本文件建议采取一种方法,利用在进化神经网络中基于关注的动态内核,提高示范分辨率能力,以调整模型参数的特色条件;通过频道关注和多层特征汇总,进一步吸引对内核的注意,以从演讲中学习全球特征;这一方法为利用较低数据资源提高代表能力提供了有效的解决办法;这是因为对模型参数结构的投入进行了自我调整;拟议的动态共生模型在VoxCeleb1测试集上实现了1.62°EER和0.18°DDCF,与使用相同培训资源的ECAPA-TDNN相比,其份量有了17°Q的相对改进。</s>