In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an easy-to-implement, simple but effective backbone for automatic speaker verification based on the Convolution-augmented Transformer (Conformer). The architecture of the MFA-Conformer is inspired by recent stateof-the-art models in speech recognition and speaker verification. Firstly, we introduce a convolution subsampling layer to decrease the computational cost of the model. Secondly, we adopt Conformer blocks which combine Transformers and convolution neural networks (CNNs) to capture global and local features effectively. Finally, the output feature maps from all Conformer blocks are concatenated to aggregate multi-scale representations before final pooling. We evaluate the MFA-Conformer on the widely used benchmarks. The best system obtains 0.64%, 1.29% and 1.63% EER on VoxCeleb1-O, SITW.Dev, and SITW.Eval set, respectively. MFA-Conformer significantly outperforms the popular ECAPA-TDNN systems in both recognition performance and inference speed. Last but not the least, the ablation studies clearly demonstrate that the combination of global and local feature learning can lead to robust and accurate speaker embedding extraction. We have also released the code for future comparison.
翻译:在本文中,我们介绍了多种规模地貌聚合组合(MFA-Confer),这是一个容易执行的简单而有效的主干网,可以在革命加速变异器(Confer)的基础上自动进行声员核查。MFA-Confer的架构受到最近最新的语音识别和演讲者核实方面最先进的模型的启发。首先,我们引入了一个分抽层,以降低模型的计算成本。第二,我们采用了将变异器和连动神经网络(CNNs)相结合的组合块,以有效捕捉全球和地方特征。最后,所有组合区块的输出地貌特征图被混为综合的多尺度演示。我们评估MFA-CA-Conform的架构受到广泛使用基准的最新模型的启发。最佳系统在VoxCeleb1-O、SITW.Dev和SITW.Eval上获得了0.63%的分层取样。MFA-CA-TDNNEV分别大大超越了广受欢迎的ECAP-TDNN系统,但在最终汇集之前,在确认的准确的对比中也展示了我们学到的演示的列表的演示,最后的学习速度。