Spoken Language Understanding (SLU) is a core task in most human-machine interaction systems. With the emergence of smart homes, smart phones and smart speakers, SLU has become a key technology for the industry. In a classical SLU approach, an Automatic Speech Recognition (ASR) module transcribes the speech signal into a textual representation from which a Natural Language Understanding (NLU) module extracts semantic information. Recently End-to-End SLU (E2E SLU) based on Deep Neural Networks has gained momentum since it benefits from the joint optimization of the ASR and the NLU parts, hence limiting the cascade of error effect of the pipeline architecture. However, little is known about the actual linguistic properties used by E2E models to predict concepts and intents from speech input. In this paper, we present a study identifying the signal features and other linguistic properties used by an E2E model to perform the SLU task. The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands. The results show that a good E2E SLU performance does not always require a perfect ASR capability. Furthermore, the results show the superior capabilities of the E2E model in handling background noise and syntactic variation compared to the pipeline model. Finally, a finer-grained analysis suggests that the E2E model uses the pitch information of the input signal to identify voice command concepts. The results and methodology outlined in this paper provide a springboard for further analyses of E2E models in speech processing.
翻译:语言语言理解(SLU)是大多数人机互动系统的核心任务。随着智能家庭、智能电话和智能语言使用者的出现,SLU已成为行业的关键技术。在经典的 SLU 方法中,自动语音识别(ASR)模块将语音信号转换成文字表达方式,自然语言理解(NLU)模块从中提取语义信息。最近,基于深神经网络的终端到 End SLU(E2E SLU) (E2E2E SLU) (E2E2E) 模块(SLU) 获得了动力,因为它得益于ASR和NLU部分的联合优化,从而限制了管道结构的错错效应。然而,对于E2E2E模型用于预测语言输入概念和意图的实际语言属性,人们知之甚少。在本文件中,我们提出一项研究,确定E2ELU模型用于执行SLU任务所使用的信号特征和其他语言属性。这项研究是在智能家庭的应用领域进行的,因为它需要进一步处理非英语语音命令(法国语系),结果显示E2E2ELU的精度分析结果。最后对E2ELU的精度分析显示E2的精度分析, 的精度能力显示,而精度分析则显示E2E2ELU的精度的精度的精度的精度。