Lexical simplification has attracted much attention in many languages, which is the process of replacing complex words in a given sentence with simpler alternatives of equivalent meaning. Although the richness of vocabulary in Chinese makes the text very difficult to read for children and non-native speakers, there is no research work for Chinese lexical simplification (CLS) task. To circumvent difficulties in acquiring annotations, we manually create the first benchmark dataset for CLS, which can be used for evaluating the lexical simplification systems automatically. In order to acquire more thorough comparison, we present five different types of methods as baselines to generate substitute candidates for the complex word that include synonym-based approach, word embedding-based approach, pretrained language model-based approach, sememe-based approach, and a hybrid approach. Finally, we design the experimental evaluation of these baselines and discuss their advantages and disadvantages. To our best knowledge, this is the first study for CLS task.
翻译:在许多语言中,法律简化引起了人们的极大注意,这是用更简单的替代语言取代某一句中复杂词语的过程。虽然中文词汇丰富,使得儿童和非母语发言人很难读到文字,但中国法律简化(CLS)任务没有研究工作。为避免获取注释方面的困难,我们手工为CLS创建了第一个基准数据集,可用于自动评估法律简化系统。为了进行更彻底的比较,我们提出了五种不同类型的方法作为基准,为复杂的词产生替代对象,该词包括同义词法、以词嵌入法为基础的方法、预先培训的语言模式方法、基于语言的方法和混合方法。最后,我们设计了对这些基线的实验性评估,并讨论了这些基线的利弊。据我们所知,这是CLS任务的首项研究。