Automatic Speech Scoring (ASS) is the computer-assisted evaluation of a candidate's speaking proficiency in a language. ASS systems face many challenges like open grammar, variable pronunciations, and unstructured or semi-structured content. Recent deep learning approaches have shown some promise in this domain. However, most of these approaches focus on extracting features from a single audio, making them suffer from the lack of speaker-specific context required to model such a complex task. We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context vectors from these responses and feed them as additional speaker-specific context to our network to score a particular response. We compare our technique with strong baselines and find that such modeling improves the model's average performance by 6.92% (maximum = 12.86%, minimum = 4.51%). We further show both quantitative and qualitative insights into the importance of this additional context in solving the problem of ASS.
翻译:自动语音转换(ASS)是计算机辅助评估候选人语言熟练程度的一种语言。 ASS系统面临许多挑战,如开放式语法、可变发音和无结构或半结构化的内容。最近深层次的学习方法在这方面显示出一些希望。然而,这些方法大多侧重于从单一音频中提取特征,使这些特征因缺乏制作这种复杂任务所需的特定演讲人背景而受到影响。我们建议为非母语的ASS,即所谓的按语言要求的等级模型,采用新的深层次学习技术。在我们的技术中,我们利用口述能力测试对候选人进行多种反应。我们从这些答复中提取了上下文矢量矢量,并将它们作为我们网络的更多特定演讲人背景,以获得特定响应。我们把我们的方法与强有力的基线作比较,发现这种模型提高了6.92%(最高=12.86%,最低=4.51%)的平均性能。我们进一步从定量和定性角度展示了这一额外背景对于解决ASS问题的重要性。