Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs' knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduce context variance into the prompt generation and propose a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we propose the concept of "Misunderstand" in LAMA for the first time. Through experiments on 12 PLMs, our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric makes BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle "understand" from just "read and copy".
翻译:预先培训的语言模型(PLM)激发了对这些模型所学知识种类的研究; 填补空白问题(例如Cloze测试)是衡量这种知识的一种自然方法。 BioLAMA 生成生物医学事实知识的提示,并使用Top-k精确度衡量标准来评价不同的PLM知识。然而,现有的研究表明,这种基于迅速知识的检验方法只能探测出较低的知识范围。许多因素,例如基于即时测试的偏差,使得LAMA的基准变得不可靠和不稳定。这个问题在BioLAMA中更为突出。在词汇和大N-M关系中的严重长期分布使得LAMA和BioLAMA之间的性能差距仍然显著。为了解决这些问题,我们将背景差异引入迅速生成并提出基于等级的新的评价指标。与以前已知的评价标准不同,我们首次在LAM 中提出了“M-M-M-M ” 环境差异推导和理解“M-Confuse-M-M-M-M-M ” 的“大规模友好和模拟关系” 设定的“M-MLARC-M” 和“Be-M-C-M-M-M-M-S-S-S-D-SIM-D-S-SIM-D-SIM-M-D-D-D-S-D-SIM-M-S-S-S-S-S-S-S-D-D-S-S-S-S-M-S-S-S-D-D-D-D-D-D-D-D-M-M-M-M-M-M-M-M-M-M-S-S-S-S-S-S-S-S-S-S-S-D-S-S-M-D-D-D-D-D-D-M-M-S-S-D-D-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-D-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-