Complex word identification (CWI) is a cornerstone process towards proper text simplification. CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets which vary greatly in terms of domains and languages. As such, it becomes increasingly more difficult to develop a robust model that generalizes across a wide array of input examples. In this paper, we propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations. This technique addresses the problem of working with multiple domains, inasmuch as it creates a way of smoothing the differences between the explored datasets. Moreover, we also propose a similar auxiliary task, namely text simplification, that can be used to complement lexical complexity prediction. Our model obtains a boost of up to 2.42% in terms of Pearson Correlation Coefficients in contrast to vanilla training techniques, when considering the CompLex from the Lexical Complexity Prediction 2021 dataset. At the same time, we obtain an increase of 3% in Pearson scores, while considering a cross-lingual setup relying on the Complex Word Identification 2018 dataset. In addition, our model yields state-of-the-art results in terms of Mean Absolute Error.
翻译:复杂的单词识别(CWI)是实现适当文本简化的基石进程。 CWI高度依赖上下文,而其困难则因缺少在领域和语言上差异很大的现有数据集而加剧。因此,越来越难以开发一个强有力的模型,该模型将各种输入实例广泛化。在本文件中,我们提出基于领域适应的CWI任务新颖的培训技术,以改进目标特性和背景表达方式。这一技术解决了与多个领域合作的问题,因为它为平息所探讨的数据集之间的差异创造了一种途径。此外,我们还提议了一项类似的辅助任务,即文本简化,可以用来补充词汇复杂性预测。我们的模式比香草培训技术得到了高达2.42%的提升,在考虑2021年词汇复杂性预测数据集的Complex时,我们考虑了2021年数据集的Complex。与此同时,我们获得了Pearson分数增加3%,同时考虑在201818年复杂WI确定数据组合时的跨语言设置。此外,我们的数据模型在201818年的绝对性中,也获得了高达2.4 %。