Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as Complex Word Identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.
翻译:对于读者来说,确定可能引起困难的词组是大多数词汇文本简化制度中在替换词汇之前的一个必要步骤,也可以用于评估文本的可读性。这项任务通常被称为复杂单词识别(CWI),通常以监管分类问题为模范。为培训这类系统,需要附加注释的数据集,在其中标注复杂程度的单词,有时是多字表达式。在本文件中,我们分析以前在这项任务中开展的工作,并调查CWI数据集的英文特性。我们为词汇复杂性的注释设计了一个协议,并用它来说明一个新的数据集,CompLex 2.0。我们用新的和旧数据集进行实验,以调查词汇复杂程度的性质。我们发现,对于确定单词的复杂性和二进注协议相比来说,类似批注协议提供了更优的客观设置。我们利用我们的新协议发布一个新的数据集,以促进Lexicical Inflocity Convention的任务。