Analogy-making between narratives is one of the most critical abilities in natural language understanding. In this paper, we evaluate the ability to identify and generate analogy by building a first-of-its-kind large-scale story-level analogy corpus, StoryAnalogy, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on StoryAnalogy, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are extremely challenging not only for the sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa, where ChatGPT only achieved around 30% accuracy in multiple-choice questions (> 85% accuracy for humans). Finally, we find that data in StoryAnalogy can improve LLMs analogy generation quality, where a fine-tuned FlanT5-xxl model yields comparable performance to zero-shot ChatGPT.
翻译:暂无翻译