We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas -- including narrative events, style, perspective, and revelation -- are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.
翻译:我们提出了NarraBench,这是一个基于理论的叙事理解任务分类体系,以及对该领域78个现有基准测试的关联性调查。我们发现,当前工作迫切需要新的评估方法,以覆盖叙事理解中那些要么被忽视、要么与现有指标匹配不佳的方面。具体而言,我们估计仅有27%的叙事任务被现有基准测试充分涵盖,并注意到某些领域——包括叙事事件、风格、视角和启示——在当前评估中几乎缺失。我们还指出,需要进一步开发能够评估叙事中构成性主观和视角性方面的基准测试,即那些通常不存在单一正确答案的方面。我们的分类体系、调查和方法论对于寻求测试大型语言模型叙事理解能力的自然语言处理研究人员具有重要价值。