Large-scale, pre-trained language models (LMs) have achieved human-level performance on a breadth of language understanding tasks. However, evaluations only based on end task performance shed little light on machines' true ability in language understanding and reasoning. In this paper, we highlight the importance of evaluating the underlying reasoning process in addition to end performance. Toward this goal, we introduce Tiered Reasoning for Intuitive Physics (TRIP), a novel commonsense reasoning dataset with dense annotations that enable multi-tiered evaluation of machines' reasoning process. Our empirical results show that while large LMs can achieve high end performance, they struggle to support their predictions with valid supporting evidence. The TRIP dataset and our baseline results will motivate verifiable evaluation of commonsense reasoning and facilitate future research toward developing better language understanding and reasoning models.
翻译:大规模、预先培训的语言模型(LMS)在广泛的语言理解任务方面实现了人文层面的成绩,然而,只根据任务完成后的表现进行评估,并不能说明机器在语言理解和推理方面的真实能力。在本文件中,我们强调评估基本推理过程的重要性以及最终表现的重要性。为了实现这一目标,我们引入了直觉物理学“铁丝线推理”(TRIP),这是一个具有密集说明的新颖的常识推理数据集,能够对机器的推理过程进行多层次的评估。我们的经验结果表明,大型LMS虽然能够取得高端的性能,但很难用有效的佐证证据支持其预测。TRIP数据集和我们的基线结果将鼓励对常识推理进行可核查的评价,并促进今后研究如何发展更好的语言理解和推理模型。