在镜头下自动语音评分系统:评价和解释语文熟练程度的语言提示 (Automated Speech Scoring System Under The Lens: Evaluating and interpreting the linguistic cues for language proficiency)

English proficiency assessments have become a necessary metric for filtering and selecting prospective candidates for both academia and industry. With the rise in demand for such assessments, it has become increasingly necessary to have the automated human-interpretable results to prevent inconsistencies and ensure meaningful feedback to the second language learners. Feature-based classical approaches have been more interpretable in understanding what the scoring model learns. Therefore, in this work, we utilize classical machine learning models to formulate a speech scoring task as both a classification and a regression problem, followed by a thorough study to interpret and study the relation between the linguistic cues and the English proficiency level of the speaker. First, we extract linguist features under five categories (fluency, pronunciation, content, grammar and vocabulary, and acoustic) and train models to grade responses. In comparison, we find that the regression-based models perform equivalent to or better than the classification approach. Second, we perform ablation studies to understand the impact of each of the feature and feature categories on the performance of proficiency grading. Further, to understand individual feature contributions, we present the importance of top features on the best performing algorithm for the grading task. Third, we make use of Partial Dependence Plots and Shapley values to explore feature importance and conclude that the best performing trained model learns the underlying rubrics used for grading the dataset used in this study.

翻译：英国熟练程度评估已成为为学术界和产业界筛选和挑选潜在候选人的必要衡量标准。随着对此类评估的需求增加,越来越有必要采用自动化的人类解释结果,以防止不一致,确保向第二语言学习者提供有意义的反馈。基于地貌的古典方法在理解评分模型所学的东西方面更易于解释。因此,在这项工作中,我们利用古典机器学习模型来制定语音评分任务,既作为一种分类,又是一个回归问题,随后进行透彻研究,解释和研究语言提示与演讲者英语熟练程度之间的关系。首先,我们从五类(流、发音、内容、语法和词汇以及声学)中提取语言学特征,并培训年级反应模型。相比之下,我们发现基于回归模型的模型与分类方法相同或更好。第二,我们进行相关研究,以了解每个特征和特征类别对资格分级工作的影响。此外,为了了解个人特征贡献,我们介绍了在进行这一评级任务中最优秀的算法的重要性(流、发、发音、内容、语法和声学),我们利用这一经过培训的分级模型进行最佳的分级研究。