NewsBERT:为智能新闻应用程序保留预先培训的语言模式 (NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application)

Pre-trained language models (PLMs) like BERT have made great progress in NLP. News articles usually contain rich textual information, and PLMs have the potentials to enhance news text modeling for various intelligent news applications like news recommendation and retrieval. However, most existing PLMs are in huge size with hundreds of millions of parameters. Many online news applications need to serve millions of users with low latency tolerance, which poses huge challenges to incorporating PLMs in these scenarios. Knowledge distillation techniques can compress a large PLM into a much smaller one and meanwhile keeps good performance. However, existing language models are pre-trained and distilled on general corpus like Wikipedia, which has some gaps with the news domain and may be suboptimal for news intelligence. In this paper, we propose NewsBERT, which can distill PLMs for efficient and effective news intelligence. In our approach, we design a teacher-student joint learning and distillation framework to collaboratively learn both teacher and student models, where the student model can learn from the learning experience of the teacher model. In addition, we propose a momentum distillation method by incorporating the gradients of teacher model into the update of student model to better transfer useful knowledge learned by the teacher model. Extensive experiments on two real-world datasets with three tasks show that NewsBERT can effectively improve the model performance in various intelligent news applications with much smaller models.

翻译：BERT等经过事先训练的语言模型(PLM)在NLP中取得了巨大进步。新闻文章通常包含丰富的文字信息,而PLM具有潜力为各种智能新闻应用程序(如新闻建议和检索)加强新闻文本模型,例如新闻建议和检索。然而,大多数现有的PLM规模巨大,有数亿个参数。许多在线新闻应用程序需要为几百万低延缓度的用户服务,这给将PLM纳入这些情景带来了巨大的挑战。知识蒸馏技术可以将大型PLM压缩成一个小得多的版本,同时保持良好的业绩。然而,现有的语言模型在像维基百科这样的普通版本上已经预先培训并被提炼,与新闻领域存在一些差距,并且可能不适宜于新闻情报领域。在本文中,我们建议NewBERT, 它可以将PMS用于高效和有效的新闻情报智能。在我们的方法中,我们设计了一个教师-学生联合学习和提炼框架,以便合作学习教师和学生模式,学生模式可以从教师模型的学习经验中学习良好的表现。此外,我们提议一种更小型的动力模型提炼方法,通过将教师的模型模型模型,通过将更好的模型模型模型转换为更好的模型,通过将高级教师学习到更深层的模型,将改进了三个的实验式的实验,从而将改进了教师的实验,从而将改进了学习到更高级的学习到更高级的模。