The ResNet-based architecture has been widely adopted to extract speaker embeddings for text-independent speaker verification systems. By introducing the residual connections to the CNN and standardizing the residual blocks, the ResNet structure is capable of training deep networks to achieve highly competitive recognition performance. However, when the input feature space becomes more complicated, simply increasing the depth and width of the ResNet network may not fully realize its performance potential. In this paper, we present two extensions of the ResNet architecture, ResNeXt and Res2Net, for speaker verification. Originally proposed for image recognition, the ResNeXt and Res2Net introduce two more dimensions, cardinality and scale, in addition to depth and width, to improve the model's representation capacity. By increasing the scale dimension, the Res2Net model can represent multi-scale features with various granularities, which particularly facilitates speaker verification for short utterances. We evaluate our proposed systems on three speaker verification tasks. Experiments on the VoxCeleb test set demonstrated that the ResNeXt and Res2Net can significantly outperform the conventional ResNet model. The Res2Net model achieved superior performance by reducing the EER by 18.5% relative. Experiments on the other two internal test sets of mismatched conditions further confirmed the generalization of the ResNeXt and Res2Net architectures against noisy environment and segment length variations.
翻译:以 ResNet 为基础的架构已被广泛采用, 用于为依赖文本的演讲者校验系统提取演讲者嵌入。 通过引入与CNN 的剩余连接并使剩余区块标准化, ResNet 架构能够培训深网络, 以实现高度竞争性的识别性。 但是, 当输入特征空间变得更加复杂, 只是增加ResNet 网络的深度和广度可能无法充分实现其性能潜力。 在本文中, 我们介绍了ResNet 架构的两个扩展, ResNeXt 和 Res2Net, 供演讲者校验。 最初为图像识别而提议的 ResNeXt 和 Res2Net 除深度和宽度外, 还引入了两个额外的层面, 基点和规模, 以提高模型的代表性。 然而, Res2Net 模型可以通过扩大规模, 代表多种颗粒度的多尺度特征, 从而特别便利发言者对短语句进行校验。 我们评估了三个发言者校验任务的拟议系统。 VoxCeleb 测试显示, ResNeb 和Res2Net Net 测试显示, 能够大大超越常规 ResNet 模型模型, 除了深度和宽宽宽宽宽宽度外, 。 Res2Net 模型, 通过降低 ER ER 的 Ereal 5 AS 的 Erealiz) 和普通环境的高级测试, 。