We study how to enhance language models (LMs) with textual commonsense knowledge. Previous work (e.g., KnowBERT) has focused on the integrating entity knowledge from knowledge graphs. In order to introduce the external entity embeddings, they learn to jointly represent the original sentences and external knowledge by pre-training on a large scale corpus. However, when switching to textual commonsense, unlike the light entity embeddings, the encoding of commonsense descriptions is heavy. Therefore, the pre-training for learning to jointly represent the target sentence and external commonsense descriptions is unaffordable. On the other hand, since pre-trained LMs for representing the target sentences alone are readily available, is it feasible to introduce commonsense knowledge in downstream tasks by fine-tuning them only? In this paper, we propose a plug-and-play method for large-scale commonsense integration without pre-training. Our method is inspired by the observation that in the regular fine-tuning for downstream tasks where no external knowledge was introduced, the variation in the parameters of the language model was minor. Our method starts from a pre-trained LM that represents the target sentences only (e.g., BERT). We think that the pre-training for joint representation learning can be avoided, if the joint representation reduces the impact of parameters on the starting LM. Previous methods such as KnowBERT proposed complex modifications to the vanilla LM to introduce external knowledge. Our model (Cook-Transformer, COmmOnsense Knowledge-enhanced Transformer), on the other hand, hardly changes the vanilla LM except adding a knowledge token in each Transformer layer. In a variety of experiments, COOK-Transformer-based BERT/RoBERTa improve their effect without any pre-training.
翻译:我们研究如何用文本常识知识加强语言模型(LMS) 。 先前的工作( 如 KondoBERT) 侧重于整合来自知识图表的实体知识。 为了引入外部实体嵌入, 他们学会通过大规模程序预修来共同代表原始句子和外部知识。 但是, 当转换成文本常识时, 不像光实体嵌入, 普通描述的编码是沉重的。 因此, 学习前培训是无法负担的, 以便共同代表目标句和外部常识描述。 另一方面, 由于仅代表目标句的预先培训 LMS 很容易获得, 因此在下游任务中引入普通知识知识是可行的。 在本文中, 我们建议了大规模普通常识整合的插播方法, 而没有引入外部知识, 我们的方法是定期微调, 语言模型的变量变异是微不足道的。 我们的方法是从开始开始开始开始开始的, 将常规变译前的LMRM 演示, 只能将常规变换为常规变换为常规变换的 。 我们的变译前的变, 将常规变的变换为常规变变的变的变的变 。 我们的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变方法,,, 我们的变变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变