While pre-trained language models (LMs) have brought great improvements in many NLP tasks, there is increasing attention to explore capabilities of LMs and interpret their predictions. However, existing works usually focus only on a certain capability with some downstream tasks. There is a lack of datasets for directly evaluating the masked word prediction performance and the interpretability of pre-trained LMs. To fill in the gap, we propose a novel evaluation benchmark providing with both English and Chinese annotated data. It tests LMs abilities in multiple dimensions, i.e., grammar, semantics, knowledge, reasoning and computation. In addition, it provides carefully annotated token-level rationales that satisfy sufficiency and compactness. It contains perturbed instances for each original instance, so as to use the rationale consistency under perturbations as the metric for faithfulness, a perspective of interpretability. We conduct experiments on several widely-used pre-trained LMs. The results show that they perform very poorly on the dimensions of knowledge and computation. And their plausibility in all dimensions is far from satisfactory, especially when the rationale is short. In addition, the pre-trained LMs we evaluated are not robust on syntax-aware data. We will release this evaluation benchmark at \url{http://xyz}, and hope it can facilitate the research progress of pre-trained LMs.
翻译:虽然经过培训的语文模型(LMS)在许多国家语言模型任务方面带来了很大的改进,但人们越来越注意探索LMS的能力并解释其预测,然而,现有的工作通常只侧重于某些具有下游任务的能力,缺乏直接评价蒙面字预测性表现和受过培训的LMS可解释性的数据集。为填补这一空白,我们提议一个新的评价基准,提供英文和中文附加说明的数据。它测试LMS在多个方面的能力,即语法、语法、语言、知识、推理和计算方面的能力。此外,它提供谨慎的附加说明的象征性理由,满足充足性和紧凑性。它包含每个最初案例的隐含性实例,以便利用扰动下的理由一致性作为忠实度的衡量标准,一种解释性观点。我们对经过广泛使用的事先受过培训的LMS进行实验。结果显示,它们在知识和计算方面表现很差。它们在所有方面都非常不令人信服,特别是在理由很短的情况下,特别是当理由很短的时候。此外,我们将在数据库前进行严格的LX号前数据评估。