In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an attention mechanism. It reliably selects contextual acoustic features in order to hypothesize semantic contents. An initial architecture capable of extracting all pronounced words and concepts from acoustic spans is designed and tested. With a shallow fusion language model, this system reaches a 13.6 concept error rate (CER) and an 18.5 concept value error rate (CVER) on the French MEDIA corpus, achieving an absolute 2.8 points reduction compared to the state-of-the-art. Then, an original model is proposed for hypothesizing concepts and their values. This transduction reaches a 15.4 CER and a 21.6 CVER without any new type of context.
翻译:在本文中,我们提出了一个使用关注机制的新颖的端到端顺序到顺序口语理解模式。 它可靠地选择了背景声学特征,以假设语义内容。 设计并测试了能够从声波范围内提取所有明显字词和概念的初步结构。 有了一种浅质的聚合语言模型,这个系统在法国MEDIA Porm上达到了13.6个概念错误率和18.5个概念值错误率, 与最新技术相比, 实现了绝对的2.8点的减少。 然后, 提出了一个原始模型, 用于虚构概念及其价值。 这种转换达到了15.4 CER 和21.6 千分之21.6 CVER, 没有任何新的环境类型 。