Figures of speech, such as metaphor and irony, are ubiquitous in literature works and colloquial conversations. This poses great challenge for natural language understanding since figures of speech usually deviate from their ostensible meanings to express deeper semantic implications. Previous research lays emphasis on the literary aspect of figures and seldom provide a comprehensive exploration from a view of computational linguistics. In this paper, we first propose the concept of figurative unit, which is the carrier of a figure. Then we select 12 types of figures commonly used in Chinese, and build a Chinese corpus for Contextualized Figure Recognition (ConFiguRe). Different from previous token-level or sentence-level counterparts, ConFiguRe aims at extracting a figurative unit from discourse-level context, and classifying the figurative unit into the right figure type. On ConFiguRe, three tasks, i.e., figure extraction, figure type classification and figure recognition, are designed and the state-of-the-art techniques are utilized to implement the benchmarks. We conduct thorough experiments and show that all three tasks are challenging for existing models, thus requiring further research. Our dataset and code are publicly available at https://github.com/pku-tangent/ConFiguRe.
翻译:诸如隐喻和讽刺等言论数字在文学作品和学术对话中无处不在,对自然语言的理解构成巨大挑战,因为语言数字通常偏离其表面含义,以表达更深的语义影响。以前的研究强调数字的文学方面,很少从计算语言的角度提供全面探讨。在本文中,我们首先提出图象单位的概念,这是数字的载体。然后我们选择中国常用的12种数字,并建立一个中国背景图识别资料(ConfigRe),与以前的象征性或句级对应人员不同,ConfifuRe旨在从讨论级别背景中提取一个比喻单位,并将比喻单位分类成正确的图象类型。Configure有三个任务,即图象提取、图象分类和图象识别,用来执行基准。我们进行彻底的实验,并显示所有三项任务都对现有的模型提出了挑战,因此需要进一步研究。我们的数据和图象代码是公开的。我们的数据和图象代码是用来执行基准。