This paper presents the SJTU system for both text-dependent and text-independent tasks in short-duration speaker verification (SdSV) challenge 2021. In this challenge, we explored different strong embedding extractors to extract robust speaker embedding. For text-independent task, language-dependent adaptive snorm is explored to improve the system performance under the cross-lingual verification condition. For text-dependent task, we mainly focus on the in-domain fine-tuning strategies based on the model pre-trained on large-scale out-of-domain data. In order to improve the distinction between different speakers uttering the same phrase, we proposed several novel phrase-aware fine-tuning strategies and phrase-aware neural PLDA. With such strategies, the system performance is further improved. Finally, we fused the scores of different systems, and our fusion systems achieved 0.0473 in Task1 (rank 3) and 0.0581 in Task2 (rank 8) on the primary evaluation metric.
翻译:本文介绍了短期演讲者核实(SdSV)挑战2021年中依赖文本和不依赖文本的任务的STU系统。在这项挑战中,我们探索了不同的强大的嵌入提取器,以吸引强大的演讲者嵌入。对于脱文本的任务,我们探索了依赖语言的适应性鼻孔,以便在跨语言的核查条件下改进系统性能。关于依赖文本的任务,我们主要侧重于基于大规模外部数据预先培训的模型的部内微调战略。为了更好地区分使用同一短语的不同演讲者,我们提出了几个新颖的注意到短语的微调战略和有觉神经的神经PLDA。有了这些战略,系统性能得到进一步的改进。最后,我们整合了不同系统的分数,我们在任务1(第3级)和任务2(第8级)中完成了0.0473项和0.0581项任务2(第8级)关于主要评价指标的整合系统。