In this study, we propose a novel multi-modal end-to-end neural approach for automated assessment of non-native English speakers' spontaneous speech using attention fusion. The pipeline employs Bi-directional Recurrent Convolutional Neural Networks and Bi-directional Long Short-Term Memory Neural Networks to encode acoustic and lexical cues from spectrograms and transcriptions, respectively. Attention fusion is performed on these learned predictive features to learn complex interactions between different modalities before final scoring. We compare our model with strong baselines and find combined attention to both lexical and acoustic cues significantly improves the overall performance of the system. Further, we present a qualitative and quantitative analysis of our model.
翻译:在这项研究中,我们建议采用新的多式终端到终端神经学方法,利用注意力聚合,自动评估非母语英语演讲者自发的演讲。管道使用双向常态神经神经网络和双向短期内存短期神经网络,分别从光谱和转录中编码声学和词汇提示。在这些已学的预测特征上进行注意结合,以了解不同方式之间在最后评分前的复杂互动。我们将模型与强有力的基线进行比较,并同时注意词汇和声学提示,大大改进了系统的总体性能。此外,我们介绍了对模型的定性和定量分析。