This paper describes speaker verification (SV) systems submitted by the SpeakIn team to the Task 1 and Task 2 of the Far-Field Speaker Verification Challenge 2022 (FFSVC2022). SV tasks of the challenge focus on the problem of fully supervised far-field speaker verification (Task 1) and semi-supervised far-field speaker verification (Task 2). In Task 1, we used the VoxCeleb and FFSVC2020 datasets as train datasets. And for Task 2, we only used the VoxCeleb dataset as train set. The ResNet-based and RepVGG-based architectures were developed for this challenge. Global statistic pooling structure and MQMHA pooling structure were used to aggregate the frame-level features across time to obtain utterance-level representation. We adopted AM-Softmax and AAM-Softmax to classify the resulting embeddings. We innovatively propose a staged transfer learning method. In the pre-training stage we reserve the speaker weights, and there are no positive samples to train them in this stage. Then we fine-tune these weights with both positive and negative samples in the second stage. Compared with the traditional transfer learning strategy, this strategy can better improve the model performance. The Sub-Mean and AS-Norm backend methods were used to solve the problem of domain mismatch. In the fusion stage, three models were fused in Task1 and two models were fused in Task2. On the FFSVC2022 leaderboard, the EER of our submission is 3.0049% and the corresponding minDCF is 0.2938 in Task1. In Task2, EER and minDCF are 6.2060% and 0.5232 respectively. Our approach leads to excellent performance and ranks 1st in both challenge tasks.
翻译:本文介绍SpeakIn小组向2022年远地议长核查挑战(FFSVC/2022)1号任务1和2号任务小组提交的发言者核查系统(SV) 。 SV的任务侧重于充分监督远地发言者核查(Task 1)和半监督远地发言者核查(Task 2)的问题。我们在1号任务中使用VoxCeleb和FFFSVVC202020数据集作为火车数据集。在2号任务中,我们只使用VoxCeleb数据集作为火车组。ResNet和REPVGG为这项挑战开发了基于ResNet和RePVGG的架构。全球统计集合结构和MMHA集合结构被用来汇总框架层面的功能,以便获得全方位代表的演示。我们在1号任务中采用了Ams-Softmax和AAM-Softmax来对由此形成的嵌入式数据进行分类。我们先期的转移学习方法,在培训阶段保留了演讲人的权重,在这个阶段没有正面的样本来训练它们。然后,在三个阶段,我们精细的模型中学习了EFSDFSB的重量。