Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these "multi-hop" explanations are evaluated relative to one (or a small number of) gold explanations. In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of included facts, as well as the completeness of model-generated explanations, because models regularly discover and produce valid explanations that are different than gold explanations. To address this, we construct a large corpus of 126k domain-expert (science teacher) relevance ratings that augment a corpus of explanations to standardized science exam questions, discovering 80k additional relevant facts not rated as gold. We build three strong models based on different methodologies (generation, ranking, and schemas), and empirically show that while expert-augmented ratings provide better estimates of explanation quality, both original (gold) and expert-augmented automatic evaluations still substantially underestimate performance by up to 36% when compared with full manual expert judgements, with different models being disproportionately affected. This poses a significant methodological challenge to accurately evaluating explanations produced by compositional reasoning models.
翻译:建构解释要求模型结合两个或两个以上事实,共同描述一个问题答案正确的原因。通常,这些“多跳”解释比一个(或少数)黄金解释得到评估。在这项工作中,我们显示这些评价大大低估了模型性能,既包括事实的相关性,也包括模型解释的完整性,因为模型经常发现并产生与黄金解释不同的有效解释。为了解决这个问题,我们建造了一大堆126k域-专家(科学教师)相关性评级,这增加了对标准化科学考试问题的解释,发现了80k项其他未被评为黄金的相关事实。我们根据不同方法(代、排名和制)建立了三个强有力的模型,从经验上表明,虽然专家推荐的评级提供了更好的解释质量估计,无论是原始的(古型)还是专家推荐的自动评价,仍然大大低估了与全手专家判断相比高达36%的绩效,而不同的模型则受到不相称的影响。这对准确评价组成推理模型作出的解释提出了重大的方法挑战。