Analyzing the readability of articles has been an important sociolinguistic task. Addressing this task is necessary to the automatic recommendation of appropriate articles to readers with different comprehension abilities, and it further benefits education systems, web information systems, and digital libraries. Current methods for assessing readability employ empirical measures or statistical learning techniques that are limited by their ability to characterize complex patterns such as article structures and semantic meanings of sentences. In this paper, we propose a new and comprehensive framework which uses a hierarchical self-attention model to analyze document readability. In this model, measurements of sentence-level difficulty are captured along with the semantic meanings of each sentence. Additionally, the sentence-level features are incorporated to characterize the overall readability of an article with consideration of article structures. We evaluate our proposed approach on three widely-used benchmark datasets against several strong baseline approaches. Experimental results show that our proposed method achieves the state-of-the-art performance on estimating the readability for various web articles and literature.
翻译:分析文章的可读性是一项重要的社会语言性任务。要自动向具有不同理解能力的读者推荐适当的文章,就必须完成这项任务,这有利于教育系统、网络信息系统和数字图书馆。目前的可读性评估方法采用经验措施或统计学习技术,这些方法由于能够描述文章结构和判决的语义含义等复杂模式而受到限制。在本文件中,我们提出了一个新的全面框架,采用等级自省模式分析文件可读性。在这个模型中,对判决难度的衡量与每一句的语义含义一起进行。此外,还纳入了句级特征,以说明文章的总体可读性,同时考虑文章的结构。我们根据若干强有力的基线方法评估了我们关于三个广泛使用的基准数据集的拟议方法。实验结果显示,我们提出的方法在估计各种网络文章和文献的可读性方面达到了最先进的业绩。