The occurrence of unknown words in texts significantly hinders reading comprehension. To improve accessibility for specific target populations, computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives. In this paper, we present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data. We survey relevant approaches to this problem which include traditional machine learning classifiers (e.g. SVMs, logistic regression) and deep neural networks as well as a variety of features, such as those inspired by literature in psycholinguistics as well as word frequency, word length, and many others. Furthermore, we introduce readers to past competitions and available datasets created on this topic. Finally, we include brief sections on applications of lexical complexity prediction, such as readability and text simplification, together with related studies on languages other than English.
翻译:文本中出现未知词会大大妨碍阅读理解。为了改善特定目标人群的可读性,应用了计算模型来识别文本中的复杂词,并用这些词替代更简单的替代词。在本文中,我们概述了以英国数据工作为重点的词汇复杂预测的计算方法。我们调查了这一问题的相关方法,其中包括传统的机器学习分类(如SVMS、后勤回归)和深神经网络,以及各种特征,例如由精神语言学文献以及字数频率、字数长度和许多其他特征所启发的特征。此外,我们介绍以往的竞赛和关于这一专题的现有数据集的读者。最后,我们包括关于词汇复杂预测的应用的简短章节,如可读性和文本简化,以及关于除英文以外其他语文的相关研究。</s>