Factual consistency is one of important summary evaluation dimensions, especially as summary generation becomes more fluent and coherent. The ESTIME measure, recently proposed specifically for factual consistency, achieves high correlations with human expert scores both for consistency and fluency, while in principle being restricted to evaluating such text-summary pairs that have high dictionary overlap. This is not a problem for current styles of summarization, but it may become an obstacle for future summarization systems, or for evaluating arbitrary claims against the text. In this work we generalize the method, and make a variant of the measure applicable to any text-summary pairs. As ESTIME uses points of contextual similarity, it provides insights into usefulness of information taken from different BERT layers. We observe that useful information exists in almost all of the layers except the several lowest ones. For consistency and fluency - qualities focused on local text details - the most useful layers are close to the top (but not at the top); for coherence and relevance we found a more complicated and interesting picture.
翻译:特别是当摘要生成更加流畅和一致时,事实上的一致性是一个重要的评价层面。最近专门为事实一致性而提出的ESTIME措施,在一致性和流畅性两方面都与人类专家评分高度相关,而原则上只限于评价词典重叠程度高的文本摘要配对。这对目前拼写风格并不成问题,但可能成为未来归纳系统或对文本任意索赔评估的障碍。在这项工作中,我们概括了方法,并提出了适用于任何文本摘要配对的措施的变体。由于ESTIME使用背景相似点,它提供了对不同BERT层所取信息的有用性的洞察力。我们注意到几乎所有层次都存在有用的信息,只有几个最低层除外。关于一致性和流利性----侧重于当地文本细节----最有用的层与上层(但不在上层);我们发现一个更为复杂和有趣的图象。