Figures of speech such as metaphors, similes, and idioms allow language to be expressive, invoke emotion, and communicate abstract ideas that might otherwise be difficult to visualize. These figurative forms are often conveyed through multiple modes, such as text and images, and frequently appear in advertising, news, social media, etc. Understanding multimodal figurative language is an essential component of human communication, and it plays a significant role in our daily interactions. While humans can intuitively understand multimodal figurative language, this poses a challenging task for machines that requires the cognitive ability to map between domains, abstraction, commonsense, and profound language and cultural knowledge. In this work, we propose the Image Recognition of Figurative Language dataset to examine vision and language models' understanding of figurative language. We leverage human annotation and an automatic pipeline we created to generate a multimodal dataset and introduce two novel tasks as a benchmark for multimodal figurative understanding. We experiment with several baseline models and find that all perform substantially worse than humans. We hope our dataset and benchmark will drive the development of models that will better understand figurative language.
翻译:比喻语言,如隐喻、比喻和习语,使语言变得富有表现力,激发情感,并传达抽象的思想,这些思想可能很难可视化。这些比喻形式经常通过多种模式传达,例如文本和图像,并经常出现在广告、新闻、社交媒体等场合。理解多模式的比喻语言是人类交流的重要组成部分,并在我们日常互动中起着重要作用。虽然人类可以直观地理解多模式的比喻语言,但这对机器来说是一项具有挑战性的任务,它需要认知能力来映射领域、抽象、常识和深入的语言和文化知识。在这项工作中,我们提出了图像识别比喻语言数据集,以检查视觉和语言模型对比喻语言的理解能力。我们利用人工注释和我们创建的自动流水线生成了一个多模态数据集,并引入了两个新颖的任务作为多模态比喻理解的基准。我们对几个基线模型进行实验,并发现它们的表现远远不及人类。我们希望我们的数据集和基准测试能推动开发更好地理解比喻语言的模型。