While large language models have shown exciting progress on several NLP benchmarks, evaluating their ability for complex analogical reasoning remains under-explored. Here, we introduce a high-quality crowdsourced dataset of narratives for employing proverbs in context as a benchmark for abstract language understanding. The dataset provides fine-grained annotation of aligned spans between proverbs and narratives, and contains minimal lexical overlaps between narratives and proverbs, ensuring that models need to go beyond surface-level reasoning to succeed. We explore three tasks: (1) proverb recommendation and alignment prediction, (2) narrative generation for a given proverb and topic, and (3) identifying narratives with similar motifs. Our experiments show that neural language models struggle on these tasks compared to humans, and these tasks pose multiple learning challenges.
翻译:虽然大型语言模型在若干国家语言规划基准上显示出令人兴奋的进展,但评估其复杂模拟推理能力的工作仍未得到充分探讨。 在这里,我们引入了一个高质量的多方源的叙事数据集,用于在语境中使用谚语作为抽象语言理解的基准。 该数据集对谚语和叙事之间的一致范围进行了细微的批注,并载有叙事和谚语之间最小的词汇重叠,确保模型要成功就必须超越表面层面的推理。 我们探索了三项任务:(1) 谚语建议和校准预测,(2) 特定谚语和专题的叙事生成,以及(3) 确定类似词型的叙事。 我们的实验表明,与人类相比,神经语言模式在这些任务上挣扎,这些任务带来了多重学习挑战。