神经语言模型团结蒸馏结构 (A Cohesive Distillation Architecture for Neural Language Models)

A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size, which prevents research groups without a necessary hardware infrastructure from participating in the development process. This study investigates methods for Knowledge Distillation (KD) to provide efficient alternatives to large-scale models. In this context, KD means extracting information about language encoded in a Neural Network and Lexical Knowledge Databases. We developed two methods to test our hypothesis that efficient architectures can gain knowledge from LMs and extract valuable information from lexical sources. First, we present a technique to learn confident probability distribution for Masked Language Modeling by prediction weighting of multiple teacher networks. Second, we propose a method for Word Sense Disambiguation (WSD) and lexical KD that is general enough to be adapted to many LMs. Our results show that KD with multiple teachers leads to improved training convergence. When using our lexical pre-training method, LM characteristics are not lost, leading to increased performance in Natural Language Understanding (NLU) tasks over the state-of-the-art while adding no parameters. Moreover, the improved semantic understanding of our model increased the task performance beyond WSD and NLU in a real-problem scenario (Plagiarism Detection). This study suggests that sophisticated training methods and network architectures can be superior over scaling trainable parameters. On this basis, we suggest the research area should encourage the development and use of efficient models and rate impacts resulting from growing LM size equally against task performance.

翻译：最近自然语言处理的趋势是语言模型(LM)规模的指数增长,这使得没有必要硬件基础设施的研究群体无法参与发展进程。本研究调查了知识蒸馏(KD)的方法,以便为大型模型提供有效的替代方法。在这方面,KD意味着提取关于神经网络和词汇知识数据库编码的语言的信息。我们开发了两种方法来测试我们的假设,即高效结构可以从LMS获得知识,并从词汇来源获取宝贵的信息。首先,我们展示了一种通过预测多个教师网络的权重来学习隐蔽语言模型的可靠概率分布的技术。第二,我们为Word Sense Debradiguation(WSD)和LLCiscial KD提出了一种方法,该方法很一般,足以适用于许多LMS。我们的结果表明,与多个教师一起的KD可以改进培训前方法,但不会丢失LM的特性,从而导致在自然语言理解(NLU)中提高了在状态上的表现,同时没有增加参数。此外,我们提出的“SmanPretical Developal laisal laisal laism laisal laction laction laction laction laction laction lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax labil lax lax labild lax labild laut lax lax lax lax labil lax lax labil lautd laut laut laut labil labal labal labal lab labal labal labal labal labal labal labal labal labal labal labal lad ro lad lad lad lad labal lab labal lab labd ro ro 要求,这可以鼓励在我们精度