Readability assessment is the process of identifying the level of ease or difficulty of a certain piece of text for its intended audience. Approaches have evolved from the use of arithmetic formulas to more complex pattern-recognizing models trained using machine learning algorithms. While using these approaches provide competitive results, limited work is done on analyzing how linguistic variables affect model inference quantitatively. In this work, we dissect machine learning-based readability assessment models in Filipino by performing global and local model interpretation to understand the contributions of varying linguistic features and discuss its implications in the context of the Filipino language. Results show that using a model trained with top features from global interpretation obtained higher performance than the ones using features selected by Spearman correlation. Likewise, we also empirically observed local feature weight boundaries for discriminating reading difficulty at an extremely fine-grained level and their corresponding effects if values are perturbed.
翻译:可读性评估是确定某一文本对预定读者的容易程度或困难程度的过程,方法已经从使用算术公式发展到使用机器学习算法培训的更复杂的模式识别模型;在使用这些方法提供竞争性结果的同时,在分析语言变数如何从数量上影响模型推理方面所做的工作有限;在这项工作中,我们通过进行全球和地方模型解释,将菲律宾基于学习的机器可读性评估模型分解开来,以了解不同语言特征的贡献,并讨论其在菲律宾语言方面的影响;结果显示,使用经过培训的具有全球解释顶尖特征的模型,其性能高于使用Spearman相关特征的模型。同样,我们还从经验上观测到在极微细的层次上区分阅读困难的当地特征重量界限,以及如果值被扰动,其相应的效果。