I2D2:与神经LoroLologic和自我吸收相结合的感性知识蒸馏 (I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation)

Pre-trained language models, despite their rapid advancements powered by scale, still fall short of robust commonsense capabilities. And yet, scale appears to be the winning recipe; after all, the largest models seem to have acquired the largest amount of commonsense capabilities. Or is it? In this paper, we investigate the possibility of a seemingly impossible match: can smaller language models with dismal commonsense capabilities (i.e., GPT-2), ever win over models that are orders of magnitude larger and better (i.e., GPT-3), if the smaller models are powered with novel commonsense distillation algorithms? The key intellectual question we ask here is whether it is possible, if at all, to design a learning algorithm that does not benefit from scale, yet leads to a competitive level of commonsense acquisition. In this work, we study the generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly. We introduce a novel commonsense distillation framework, I2D2, that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale models as the teacher model by two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model's own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-Tomic, that is of the largest and highest quality available to date.

翻译：受过训练的语文模式,尽管它们以规模为动力的快速进步,但仍低于强健的常识蒸馏能力。然而,规模似乎是一个赢取的配方;毕竟,最大的模型似乎已经获得了最大的常识能力。或者说它吗?在本文中,我们调查了一个似乎不可能匹配的可能性:拥有令人沮丧的常识能力(即,GPT-2)的较小语言模式能否在规模更大和更好(例如,GPT-3)的排序上获胜?如果小模型具有新的常识蒸馏算法的动力,那么规模仍然不足。然而,我们在这里询问的关键知识问题是,如果最大模型似乎已经获得了最大数量的常识吸收能力。在本文中,我们研究的常识知识知识(即,GPT-2)的组合模式能否在规模更大和规模更高(比如,GPT3)的模型上获胜,我们引入了一个新的常识蒸馏框架,I2D2,我们这里询问的关键知识问题是,如果有可能设计一个比普通的常识蒸馏能力,但是,那么,我们自己的常识变的常识变的常识获取能力是否在西方的模型上更深入地展示了。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/