Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.
翻译:文本可读性评分方法的研究已持续一个多世纪,广泛应用于多个领域的研究及面向用户的应用中。迄今为止,此类方法的开发与评估主要依赖两类离线行为数据:阅读理解测试表现和文本可读性等级评分。本研究转而关注可读性中一个基础但研究不足的维度——通过眼动追踪在线测量捕获的实时阅读流畅度。我们提出一个可读性评分方法的评估框架,该框架在控制文本间内容差异的前提下,量化这些方法对阅读流畅度的解释能力。将该评估应用于主流传统可读性公式、现代机器学习系统、前沿大型语言模型以及教育领域商用系统后发现,它们对英语阅读流畅度的预测能力均较弱。这一结论在母语与非母语读者、不同阅读模式以及不同长度文本单元中均成立。评估进一步表明,现有方法的预测效果常逊色于心理语言学中常用于预测阅读时间的词汇属性。我们的研究结果揭示了现有可读性评分方法的根本局限,彰显了心理语言学在可读性研究中的价值,并强调需要开发新的、认知驱动型可读性评分方法以更准确地反映阅读流畅度。