We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
翻译:我们展示了一个大规模实验的结果,用于培训前的编程器,其非编程参数计数范围从700M至9.3B不等,随后将其蒸馏成17M-170M参数的较小模型,并将其应用于虚拟助理系统自然语言理解(NLU)部分。虽然我们用70%的口式数据来培训,但我们的教师模型与XLM-R和mT5相比,在书面的跨语言语言自然语言推断(XNLI)系统(XNLI)中进行了可比较的实验。我们用我们系统内部的数据对教师模型进行第二阶段的预培训,将目标分类的自动错误率提高3.86%相对值提高3.01%,将位置填充的相对值提高7.01%。我们发现,即使从第2阶段教师模型中提炼的170M参数模型也有2.88%更好的意图分类,而7.69%的补位错误率比仅用公共数据培训的2.3B参数教师(Stage 1)要高,强调从我们内部培训的师级数据的重要性。当使用NLUSAL-23全文模型进行离线下的数据时,用我们目前的X系统升级的升级的全文模型进行评估时,用17M-MM模模模模的X的X。