The increasing demand for learning English as a second language has led to a growing interest in methods for automatically assessing spoken language proficiency. Most approaches use hand-crafted features, but their efficacy relies on their particular underlying assumptions and they risk discarding potentially salient information about proficiency. Other approaches rely on transcriptions produced by ASR systems which may not provide a faithful rendition of a learner's utterance in specific scenarios (e.g., non-native children's spontaneous speech). Furthermore, transcriptions do not yield any information about relevant aspects such as intonation, rhythm or prosody. In this paper, we investigate the use of wav2vec 2.0 for assessing overall and individual aspects of proficiency on two small datasets, one of which is publicly available. We find that this approach significantly outperforms the BERT-based baseline system trained on ASR and manual transcriptions used for comparison.
翻译:越来越多的人要求学习英语作为第二语言,这导致人们对自动评估口语熟练程度的方法越来越感兴趣。大多数方法使用手工制作的特征,但其效力取决于其特定的基本假设,有可能丢弃关于熟练程度的潜在显著信息。其他方法则依赖ASR系统制作的抄录,这些抄录可能无法在具体情况下忠实地解说学习者的话语(例如非本地儿童自发的演讲)。此外,抄录并未产生任何有关方面的信息,如口语、节奏或演练等。在本文件中,我们调查使用 wav2vec 2.0来评估两个小数据集熟练程度的总体和个人方面,其中一个数据集是公开提供的。我们发现,这种方法大大地超越了BERT的基于ASR培训的基线系统和用于比较的人工抄录。