Even in highly-developed countries, as many as 15-30\% of the population can only understand texts written using a basic vocabulary. Their understanding of everyday texts is limited, which prevents them from taking an active role in society and making informed decisions regarding healthcare, legal representation, or democratic choice. Lexical simplification is a natural language processing task that aims to make text understandable to everyone by replacing complex vocabulary and expressions with simpler ones, while preserving the original meaning. It has attracted considerable attention in the last 20 years, and fully automatic lexical simplification systems have been proposed for various languages. The main obstacle for the progress of the field is the absence of high-quality datasets for building and evaluating lexical simplification systems. We present a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese, and provide details about data selection and annotation procedures. This is the first dataset that offers a direct comparison of lexical simplification systems for three languages. To showcase the usability of the dataset, we adapt two state-of-the-art lexical simplification systems with differing architectures (neural vs.\ non-neural) to all three languages (English, Spanish, and Brazilian Portuguese) and evaluate their performances on our new dataset. For a fairer comparison, we use several evaluation measures which capture varied aspects of the systems' efficacy, and discuss their strengths and weaknesses. We find a state-of-the-art neural lexical simplification system outperforms a state-of-the-art non-neural lexical simplification system in all three languages. More importantly, we find that the state-of-the-art neural lexical simplification systems perform significantly better for English than for Spanish and Portuguese.
翻译:即使在高度发达的国家,多达15-30 ⁇ 的人口只能用基本词汇理解书面文本。他们对日常文本的理解有限,这使得他们无法在社会中发挥积极的作用,也无法在医疗、法律代表或民主选择方面做出知情的决定。 法律简化是一项自然语言处理任务,目的是让每个人都能理解文本,用更简单的词汇和表达方式取代复杂的词汇和表达方式,同时保留原始含义。在过去20年中,它吸引了相当多的注意力,为各种语言提出了完全自动的简化系统。该领域进展的主要障碍是缺乏用于建设和评估简化语言的高质量数据集。我们用英语、西班牙语和(巴西语)葡萄牙语为简化提供新的基准数据集,并提供有关数据选择和注释程序的细节。这是第一个数据集,为三种语言的简化系统提供了直接比较。为了展示数据集的可用性,我们用两种最先进的简化语言系统来调整两种状态简化系统,其不同的结构(内部非内部语言)缺乏高质量的简化数据,我们用三种语言进行更公平的比较。