Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.
翻译:文本可读性评分方法的研究已持续一个多世纪,并广泛应用于多个领域的研究及面向用户的应用中。迄今为止,此类方法的开发与评估主要依赖于两种离线行为数据:阅读理解测试的表现和文本可读性等级评分。本研究中,我们转而关注可读性的一个基础且研究不足的方面——实时阅读流畅度,通过眼动追踪的在线阅读测量进行捕捉。我们提出了一种可读性评分方法的评估框架,该框架在控制文本间内容差异的同时,量化这些方法解释阅读流畅度的能力。将此评估应用于主流的传统可读性公式、现代机器学习系统、前沿大语言模型以及教育领域使用的商业系统后发现,它们对英语阅读流畅度的预测能力均较差。这一结果在母语与非母语者、不同阅读模式以及不同长度文本单元中均保持一致。评估进一步表明,现有方法的预测效果常不及心理语言学中常用于预测阅读时间的词汇属性。我们的研究结果凸显了现有可读性评分方法的根本局限,证实了心理语言学在可读性研究中的实用价值,并指出需要开发新的、认知驱动的可读性评分方法以更好地解释阅读流畅度。