End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to developing an E2E model for generalized SLU in commercial voice assistants (VAs). We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels. This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations. This leads to an SLU system that achieves significant improvements over baselines on a complex internal generalized VA dataset with a 43% improvement in accuracy, while still meeting the 99% accuracy benchmark on the popular Fluent Speech Commands dataset. We further evaluate our model on a hard test set, exclusively containing slot arguments unseen in training, and demonstrate a nearly 20% improvement, showing the efficacy of our approach in truly demanding VA scenarios.
翻译:终端到终端语言理解(E2E)系统使用单一模式直接从语言上直接预测语义语义的语义。该领域以前的工作侧重于固定域中的目标任务,即产出语义结构假定为先验性,输入语言的复杂程度有限。在这项工作中,我们提出了在商业语音助理(Vas)中开发通用语言语言理解(E2E)模式的方法。我们提议了一个完全不同的、基于变压器的、等级系统,可在ASR和NLU两级预先培训。然后对文字和语义分类损失进行微调,以便处理一套不同的意图和论据组合。这导致SLU系统在复杂的内部通用VA数据集基线上取得重大改进,精确度提高了43%,同时仍然达到流行的流言指令数据集99%的精确基准。我们进一步评估了我们关于硬测试集的模型,只包含在培训中看不见的位置参数,并显示近20%的改进,显示了我们在真正严格要求VA情景中的方法的有效性。