Pre-trained language models, despite their rapid advancements powered by scale, still fall short of robust commonsense capabilities. And yet, scale appears to be the winning recipe; after all, the largest models seem to have acquired the largest amount of commonsense capabilities. Or is it? In this paper, we investigate the possibility of a seemingly impossible match: can smaller language models with dismal commonsense capabilities (i.e., GPT-2), ever win over models that are orders of magnitude larger and better (i.e., GPT-3), if the smaller models are powered with novel commonsense distillation algorithms? The key intellectual question we ask here is whether it is possible, if at all, to design a learning algorithm that does not benefit from scale, yet leads to a competitive level of commonsense acquisition. In this work, we study the generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly. We introduce a novel commonsense distillation framework, I2D2, that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale models as the teacher model by two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model's own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-Tomic, that is of the largest and highest quality available to date.
翻译:受过训练的语文模式,尽管它们以规模为动力的快速进步,但仍低于强健的常识蒸馏能力。然而,规模似乎是一个赢取的配方;毕竟,最大的模型似乎已经获得了最大的常识能力。 或者说它吗?在本文中,我们调查了一个似乎不可能匹配的可能性:拥有令人沮丧的常识能力(即,GPT-2)的较小语言模式能否在规模更大和更好(例如,GPT-3)的排序上获胜?如果小模型具有新的常识蒸馏算法的动力,那么规模仍然不足。然而,我们在这里询问的关键知识问题是,如果最大模型似乎已经获得了最大数量的常识吸收能力。在本文中,我们研究的常识知识知识(即,GPT-2)的组合模式能否在规模更大和规模更高(比如,GPT3)的模型上获胜,我们引入了一个新的常识蒸馏框架,I2D2,我们这里询问的关键知识问题是,如果有可能设计一个比普通的常识蒸馏能力,但是,那么,我们自己的常识变的常识变的常识获取能力是否在西方的模型上更深入地展示了。