In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, we investigate two domain adaptation approaches to allow adapting an existing language identification model without retraining the model parameters for a new domain. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-based models significantly outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation improve model accuracy.
翻译:在本文中,我们引入了一个基于符合标准层的新语言识别系统。我们建议建立一个关注时间集合机制,使模型能够通过经常性的形式以长式音频传递信息,从而可以以流态方式进行推论。此外,我们调查了两种领域适应办法,以便能够在不再培训新领域示范参数的情况下调整现有语言识别模型。我们比较研究了不同模式在不同模式大小限制下的不同模式,发现基于符合标准的模式大大超过基于LSTM和变压器模型。我们的实验还表明,关注时间集合和域适应提高了模型的准确性。