End-to-end speech-to-intent classification has shown its advantage in harvesting information from both text and speech. In this paper, we study a technique to develop such an end-to-end system that supports multiple languages. To overcome the scarcity of multi-lingual speech corpus, we exploit knowledge from a pre-trained multi-lingual natural language processing model. Multi-lingual bidirectional encoder representations from transformers (mBERT) models are trained on multiple languages and hence expected to perform well in the multi-lingual scenario. In this work, we employ a teacher-student learning approach to sufficiently extract information from an mBERT model to train a multi-lingual speech model. In particular, we use synthesized speech generated from an English-Mandarin text corpus for analysis and training of a multi-lingual intent classification model. We also demonstrate that the teacher-student learning approach obtains an improved performance (91.02%) over the traditional end-to-end (89.40%) intent classification approach in a practical multi-lingual scenario.
翻译:在本文中,我们研究开发这样一种支持多种语言的终端到终端系统的技术。为了克服多种语言的缺乏,我们利用预先培训的多语言自然语言处理模式的知识。变压器(mBERT)模型的多语言双向双向电解码演示用多种语言进行培训,因此在多语言情景中预期效果良好。在这项工作中,我们采用了师生学习方法,充分提取MBERT模型的信息,以培训一种多语言的演讲模式。特别是,我们利用英语-汉语文本组合生成的合成语音来分析和培训一种多语言意图分类模式。我们还表明,在实用的多语言情景中,教师-学生学习方法比传统的端对端(89.40%)意向分类方法提高了业绩(91.02% )。