In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU models have been proposed to directly map speech input to desired semantic frame with a single model, hence mitigating ASR error propagation. Recently, pre-training technologies have been explored for these E2E models. In this paper, we propose a novel joint textual-phonetic pre-training approach for learning spoken language representations, aiming at exploring the full potentials of phonetic information to improve SLU robustness to ASR errors. We explore phoneme labels as high-level speech features, and design and compare pre-training tasks based on conditional masked language model objectives and inter-sentence relation objectives. We also investigate the efficacy of combining textual and phonetic information during fine-tuning. Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models and improves robustness of spoken language understanding to ASR errors.
翻译:在传统的口头语言理解(SLU)结构中,观察到自动语音识别错误会损害自然语言理解的性能。建议端到端(E2E) SLU模型将语音输入直接映射到理想的语义框架,采用单一模型,从而减轻ASR错误的传播。最近,为这些E2E模型探索了培训前技术。在本文件中,我们提出了学习口头语言代表的新型文本-语言联合培训前方法,目的是探索语音信息的全部潜力,以提高SLU对ASR错误的稳健性。我们把电话标签作为高级语音特征进行探索,并设计和比较基于有条件的蒙面语言模式目标和主题关系目标的培训前任务。我们还调查了在微调期间将文字和语音信息相结合的功效。关于口头语言理解基准、流出语音指令和SNIPS的实验结果显示,拟议方法大大超越了强势基线模型,提高了口头语言对ASR错误的理解度。