We report two essential improvements in readability assessment: 1. three novel features in advanced semantics and 2. the timely evidence that traditional ML models (e.g. Random Forest, using handcrafted features) can combine with transformers (e.g. RoBERTa) to augment model performance. First, we explore suitable transformers and traditional ML models. Then, we extract 255 handcrafted linguistic features using self-developed extraction software. Finally, we assemble those to create several hybrid models, achieving state-of-the-art (SOTA) accuracy on popular datasets in readability assessment. The use of handcrafted features help model performance on smaller datasets. Notably, our RoBERTA-RF-T1 hybrid achieves the near-perfect classification accuracy of 99%, a 20.3% increase from the previous SOTA.
翻译:我们报告了可读性评估方面的两个基本改进:1. 高级语义学的三个新特征;2. 及时证明传统ML模型(如随机森林,使用手工制作的特征)可以与变压器(如Robreta)结合,以提高模型性能。首先,我们探索适当的变压器和传统的ML模型。然后,我们利用自开发的提取软件提取255个手工制作的语言特征。最后,我们将这些特征组装起来,以创建若干混合模型,在可读性评估中实现大众数据集的最新精确度。使用手工艺性特征有助于小型数据集的模型性能。值得注意的是,我们的ROBERTA-RF-T1混合体实现了近乎完美99%的分类精确度,比以前的SOTA增加了20.3%。