The lack of wide coverage datasets annotated with everyday metaphorical expressions for languages other than English is striking. This means that most research on supervised metaphor detection has been published only for that language. In order to address this issue, this work presents the first corpus annotated with naturally occurring metaphors in Spanish large enough to develop systems to perform metaphor detection. The presented dataset, CoMeta, includes texts from various domains, namely, news, political discourse, Wikipedia and reviews. In order to label CoMeta, we apply the MIPVU method, the guidelines most commonly used to systematically annotate metaphor on real data. We use our newly created dataset to provide competitive baselines by fine-tuning several multilingual and monolingual state-of-the-art large language models. Furthermore, by leveraging the existing VUAM English data in addition to CoMeta, we present the, to the best of our knowledge, first cross-lingual experiments on supervised metaphor detection. Finally, we perform a detailed error analysis that explores the seemingly high transfer of everyday metaphor across these two languages and datasets.
翻译:缺少广泛的覆盖数据集,加上日常英文以外语言的比喻表达方式,这是惊人的。这意味着大多数关于监督比喻检测的研究都只针对该语言发表。为了解决这一问题,这项工作提供了第一批附加注释的文集,其中附有西班牙语的自然隐喻,足以开发比喻检测系统。介绍的数据集CoMeta包含来自不同领域的文本,即新闻、政治话语、维基百科和评论。为了给CoMeta贴上标签,我们采用了MIPVU方法,这是最常用的用于系统描述真实数据比喻的指导方针。我们使用我们新创建的数据集,通过微调多种多语种和单语种最先进的大语言模型,提供竞争性基线。此外,我们利用现有的VUAM英语数据来利用CoMeta以外的现有数据,向我们介绍我们最熟悉的关于监督比喻检测的首次跨语言实验。最后,我们进行了详细的错误分析,探索在这两种语言和数据集之间似乎高的日常隐喻转移。