In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm. We find that pre-trained speech models optimally encode language discriminatory information in lower layers. Further, we demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages and different acoustic environments without additional training. After fine-tuning a pre-trained Conformer model on the VoxLingua107 dataset, we achieve results similar to current state-of-the-art systems for language identification. More, our model accomplishes this with 5x less parameters. We open-source the model through the NVIDIA NeMo toolkit.
翻译:在本文中,我们通过在多语种培训前模式中试验基于Confrench的建筑来推广先前的自我监督的语言识别方法。我们发现,经过培训的演讲模式在低层对语言歧视信息进行最优化的编码。此外,我们证明,从这些层层获得的嵌入功能非常强大,可以在不经过额外培训的情况下对看不见的语言和不同的声学环境进行分类。在对VoxLingua107数据集的预先培训的连接模型进行微调之后,我们取得了与目前最新的语言识别系统相似的成果。此外,我们的模式用5x的参数来完成这一任务。我们通过NVIDIA Nemo工具包打开了该模型的来源。