This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised method in computer vision (CV), distillation with no labels (DINO), is used to train our initial model, which outperformed the last year's contrastive learning based on momentum contrast (MoCo). Also, this requires only a few iterations in the iterative clustering stage, where pseudo labels for supervised embedding learning are updated based on the clusters of the embeddings generated from a model that is continually fine-tuned over iterations. In the final stage, Res2Net50 is trained on the final pseudo labels from the iterative clustering stage. This is our best submitted model to the challenge, showing 1.89, 6.50, and 6.89 in EER(%) in voxceleb1 test o, VoxSRC-21 validation, and test trials, respectively.
翻译:本技术报告描述了约翰·霍普金斯大学向沃克斯切莱布议长承认挑战2021年第3轨提交的发言者识别系统:自我监督的演讲者核查(结束),我们的总体培训过程类似于去年VoxSRC2020挑战中第一组团队的拟议培训过程,主要区别在于最近提出的计算机视觉(CV)中非争议性自我监督的自我监督方法,在没有标签的蒸馏(DINO)中,用于培训我们最初的模型,该模型在动力对比(MooCo)的基础上,超过了去年的对比学习。此外,这只需要在迭代组阶段进行一些迭代式循环,在这个阶段,根据不断微调的迭代组合模型产生的嵌入式模块更新监督嵌入学习的假标签。在最后阶段,Res2-Net50在迭代组合阶段就最后的假标签进行了培训。这是我们针对挑战提交的最佳模型,在Voxceleb1号测试O、VoxSRC-21号验证和测试试验中分别显示1.89、6.50和6.89 EER(%)中,这分别显示1.89号测试O、VoxSRC-21号验证和试验。