The goal of this paper is to train speaker embeddings that are robust to bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse the effect of bilingual speakers on speaker recognition performance. This paper proposes a new large-scale evaluation set derived from VoxCeleb that considers bilingual scenarios. We also introduce a representation learning strategy, which disentangles language information from speaker representation to account for the bilingual scenario. This language-disentangled representation learning strategy can be adapted to existing models with small changes to the training pipeline. Experimental results demonstrate that the baseline models suffer significant performance degradation when evaluated on the proposed bilingual test set. On the contrary, the model trained with the proposed disentanglement strategy shows significant improvement under the bilingual evaluation scenario while simultaneously retaining competitive performance on existing monolingual test sets.
翻译:本文的目的是培训与双语演讲情景相适应的演讲者嵌入。世界人口的大多数人口至少讲两种语言;然而,大多数演讲者识别系统在以不同语言发言时都无法认出同一发言者。普通的演讲者识别评价组不考虑双语情景,因此难以分析双语演讲者对语音识别表现的影响。本文提出了一套新的大型评价组,由VoxCeleb提出,考虑到双语情景。我们还引入了一种代表学习战略,将语言信息与演讲者表述方式分开,以考虑双语情景。这种语言分解的教学战略可以适应现有模式,对培训流程稍作改动。实验结果表明,在对拟议的双语测试组进行评估时,基线模型的性能显著下降。相反,经过培训的与拟议混合战略的模型显示双语评估情景下显著改进,同时保留现有单一语言测试组的竞争性表现。