How much knowledge do pretrained language models hold? Recent research observed that pretrained transformers are adept at modeling semantics but it is unclear to what degree they grasp human knowledge, or how to ensure they do so. In this paper we incorporate knowledge-awareness in language model pretraining without changing the transformer architecture, inserting explicit knowledge layers, or adding external storage of semantic information. Rather, we simply signal the existence of entities to the input of the transformer in pretraining, with an entity-extended tokenizer; and at the output, with an additional entity prediction task. Our experiments show that solely by adding these entity signals in pretraining, significantly more knowledge is packed into the transformer parameters: we observe improved language modeling accuracy, factual correctness in LAMA knowledge probing tasks, and semantics in the hidden representations through edge probing.We also show that our knowledge-aware language model (KALM) can serve as a drop-in replacement for GPT-2 models, significantly improving downstream tasks like zero-shot question-answering with no task-related training.
翻译:受过训练的语言模型有多少知识? 最近的研究表明,预先训练的变压器在模拟语义学时非常熟练,但不清楚它们在多大程度上掌握了人类知识,或如何确保这些知识。 本文中我们把知识意识纳入语言模型预培训阶段,而不改变变压器结构,插入明确的知识层,或增加语义信息的外部储存。相反,我们只是向预培训阶段变压器输入的变压器中发出实体的存在信号,配有实体扩展的代用品;在输出时,增加实体预测任务。我们的实验显示,仅仅通过在预培训中添加这些实体信号,变压器参数中就包含了更多的知识:我们看到语言建模的准确性得到改善,LAMA知识探测任务的实际正确性,通过边缘探测在隐蔽的表达中显示语义模型(KALM)可以作为GPT-2模型的滴置替代,大大改进下游任务,如没有任务培训的零发问题解。