In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs allows SLU systems to improve in comparison to the 1-best setup (4% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, and a relative improvement of 18% over the 1-best configuration. Thus, crossmodal architectures represent a good alternative to overcome the limitations of working purely automatically generated textual data.
翻译:在本文中,我们详尽地评价了不同的表述方式,以解决语言理解(SLU)设置中的意图分类问题。我们为SLU意图检测任务确定了三种类型的系统:1)基于文本的系统,2)基于光线的系统,以及新的3)多式联运方法。我们的工作全面分析了不同情况下不同最先进的SLU系统的可实现性能,例如自动与人工生成的记录誊本。我们评估了公开提供的 SLURP 口语资源库上的系统。我们的结果显示,使用较丰富的自动语音识别(ASR)输出形式,SLU系统可以改进与1个最佳设置(4%相对改进)的系统。然而,交叉模式方法,即从声学和文字嵌入中学习,取得与设置或设置相近的性能,比1个最佳配置相对改进了18%。因此,交叉模式结构是克服纯自动生成的文字数据的局限性的一个很好的替代方法。