Idioms, are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters. Due to the properties of non-compositionality and metaphorical meaning, Chinese Idioms are hard to be understood by children and non-native speakers. This study proposes a novel task, denoted as Chinese Idiom Paraphrasing (CIP). CIP aims to rephrase idioms-included sentences to non-idiomatic ones under the premise of preserving the original sentence's meaning. Since the sentences without idioms are easier handled by Chinese NLP systems, CIP can be used to pre-process Chinese datasets, thereby facilitating and improving the performance of Chinese NLP tasks, e.g., machine translation system, Chinese idiom cloze, and Chinese idiom embeddings. In this study, CIP task is treated as a special paraphrase generation task. To circumvent difficulties in acquiring annotations, we first establish a large-scale CIP dataset based on human and machine collaboration, which consists of 115,530 sentence pairs. We further deploy three baselines and two novel CIP approaches to deal with CIP problems. The results show that the proposed methods have better performances than the baselines based on the established CIP dataset.
翻译:普通语言是中国语言的一种语言表达方式, 大部分由四个中国字母组成。 由于非组合和隐喻含义的特性, 中国语言很难被儿童和非本地语言使用者理解。 这项研究提出了一个新的任务, 称为中国语言分解( CIP) 。 计算机IP 旨在根据保留原句含义的前提, 将包含语言的句子改写为非语言的句子。 由于没有语言的句子较容易由中国国家语言方案系统处理, 计算机IP 可用于预处理中国的数据集, 从而便利和改进中国国家语言方案任务的绩效, 例如机器翻译系统、 中国语分解和中国语嵌入。 在这次研究中, CIP 任务被视为一种特殊的语言生成任务。 为了避免获取语句方面的困难, 我们首先在人与机器合作的基础上建立了一个大型的 CIP 数据集, 包括 115, 530 对应的句子。 我们进一步运用了三个基准和两个新型的CIP 模式, 来展示CIP 。