BERTNESA:调查BERT中知识的捕获和遗忘情况 (BERTnesia: Investigating the capture and forgetting of knowledge in BERT)

Probing complex language models has recently revealed several insights into linguistic and semantic patterns found in the learned representations. In this article, we probe BERT specifically to understand and measure the relational knowledge it captures in its parametric memory. While probing for linguistic understanding is commonly applied to all layers of BERT as well as fine-tuned models, this has not been done for factual knowledge. We utilize existing knowledge base completion tasks (LAMA) to probe every layer of pre-trained as well as fine-tuned BERT models(ranking, question answering, NER). Our findings show that knowledge is not just contained in BERT's final layers. Intermediate layers contribute a significant amount (17-60%) to the total knowledge found. Probing intermediate layers also reveals how different types of knowledge emerge at varying rates. When BERT is fine-tuned, relational knowledge is forgotten. The extent of forgetting is impacted by the fine-tuning objective and the training data. We found that ranking models forget the least and retain more knowledge in their final layer compared to masked language modeling and question-answering. However, masked language modeling performed the best at acquiring new knowledge from the training data. When it comes to learning facts, we found that capacity and fact density are key factors. We hope this initial work will spur further research into understanding the parametric memory of language models and the effect of training objectives on factual knowledge. The code to repeat the experiments is publicly available on GitHub.

翻译：检验复杂的语言模型最近揭示了对在所学的表述中发现的语言和语义模式的一些洞察力。在本条中,我们专门调查生物伦理学研究小组,以了解和测量它在参数记忆中获取的关系知识。虽然对语言理解的测试通常适用于生物伦理学研究小组的所有层次以及微调模型,但对于事实知识并没有这样做。我们利用现有的知识基础完成任务(LAMA)来探测经过预先训练的和经过精细调的BERT模型(级别、回答问题、NER)的每一层。我们的研究结果表明,知识并不仅仅包含在生物伦理学小组的最后一层中。中间层为发现的全部知识贡献了相当大的数量(17-60% ) 。检验中间层还揭示了不同类型知识以不同的速度出现的情况。当生物伦理学小组进行微调时,关系知识被遗忘了。我们发现,由于微调目标和培训数据数据的影响,排位模型在最后一层中会忘记最少的知识,并且保留更多的知识,而与隐蔽的语言模型和问答相比。然而,隐藏的语言模型对于所发现的总知识的贡献是相当大的数量。但是,在获得的最初的实验能力中,我们从获得的深度研究中掌握了新的记忆中,我们掌握了最新的学习中的主要数据。