This paper contains a post-challenge performance analysis on cross-lingual speaker verification of the IDLab submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We show that current speaker embedding extractors consistently underestimate speaker similarity in within-speaker cross-lingual trials. Consequently, the typical training and scoring protocols do not put enough emphasis on the compensation of intra-speaker language variability. We propose two techniques to increase cross-lingual speaker verification robustness. First, we enhance our previously proposed Large-Margin Fine-Tuning (LM-FT) training stage with a mini-batch sampling strategy which increases the amount of intra-speaker cross-lingual samples within the mini-batch. Second, we incorporate language information in the logistic regression calibration stage. We integrate quality metrics based on soft and hard decisions of a VoxLingua107 language identification model. The proposed techniques result in a 11.7% relative improvement over the baseline model on the VoxSRC-21 test set and contributed to our third place finish in the corresponding challenge.
翻译:本文载有对向2021年VoxCeleb议长承认挑战(VoxSRC-21)提出的IDLab报告进行跨语言语言语言校验的质疑后表现分析。我们显示,目前的发言者在语音内跨语言试验中嵌入提取器始终低估了发言者的相似性。因此,典型的培训和评分协议没有足够强调对讲者内部语言变异性的补偿。我们提出了两种提高跨语言发言者核查的稳健性的方法。首先,我们用小型批量抽样战略加强我们原先提议的大区价-托宁(LM-FT)培训阶段,增加小盘中讲者内部跨语言样本的数量。第二,我们将语言信息纳入后勤回归校准阶段。我们根据VoxLingua107语言识别模型的软硬决定整合了质量指标。拟议技术导致比VoxSRC-21测试集基准模型的11.7%的相对改进,有助于我们在相应的挑战中完成第三个阶段的工作。