Alexa 教师模型:为自然语言理解系统预培训和蒸馏数亿字数的多语言理解系统编码器 (Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems)

Jack FitzGerald,Shankar Ananthakrishnan,Konstantine Arkoudas,Davide Bernardi,Abhishek Bhagia,Claudio Delli Bovi,Jin Cao,Rakesh Chada,Amit Chauhan,Luoxin Chen,Anurag Dwarakanath,Satyam Dwivedi,Turan Gojayev,Karthik Gopalakrishnan,Thomas Gueudre,Dilek Hakkani-Tur,Wael Hamza,Jonathan Hueser,Kevin Martin Jose,Haidar Khan,Beiye Liu,Jianhua Lu,Alessandro Manzotti,Pradeep Natarajan,Karolina Owczarzak,Gokmen Oz,Enrico Palumbo,Charith Peris,Chandana Satya Prakash,Stephen Rawls,Andy Rosenbaum,Anjali Shenoy,Saleh Soltan,Mukund Harakere Sridhar,Liz Tan,Fabian Triefenbach,Pan Wei,Haiyang Yu,Shuai Zheng,Gokhan Tur,Prem Natarajan

from arxiv, KDD 2022

We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.

翻译：我们展示了一个大规模实验的结果,用于培训前的编程器,其非编程参数计数范围从700M至9.3B不等,随后将其蒸馏成17M-170M参数的较小模型,并将其应用于虚拟助理系统自然语言理解(NLU)部分。虽然我们用70%的口式数据来培训,但我们的教师模型与XLM-R和mT5相比,在书面的跨语言语言自然语言推断(XNLI)系统(XNLI)中进行了可比较的实验。我们用我们系统内部的数据对教师模型进行第二阶段的预培训,将目标分类的自动错误率提高3.86%相对值提高3.01%,将位置填充的相对值提高7.01%。我们发现,即使从第2阶段教师模型中提炼的170M参数模型也有2.88%更好的意图分类,而7.69%的补位错误率比仅用公共数据培训的2.3B参数教师(Stage 1)要高,强调从我们内部培训的师级数据的重要性。当使用NLUSAL-23全文模型进行离线下的数据时,用我们目前的X系统升级的升级的全文模型进行评估时,用17M-MM模模模模的X的X。