Spoken language understanding (SLU) topic has seen a lot of progress these last three years, with the emergence of end-to-end neural approaches. Spoken language understanding refers to natural language processing tasks related to semantic extraction from speech signal, like named entity recognition from speech or slot filling task in a context of human-machine dialogue. Classically, SLU tasks were processed through a cascade approach that consists in applying, firstly, an automatic speech recognition process, followed by a natural language processing module applied to the automatic transcriptions. These three last years, end-to-end neural approaches, based on deep neural networks, have been proposed in order to directly extract the semantics from speech signal, by using a single neural model. More recent works on self-supervised training with unlabeled data open new perspectives in term of performance for automatic speech recognition and natural language processing. In this paper, we present a brief overview of the recent advances on the French MEDIA benchmark dataset for SLU, with or without the use of additional data. We also present our last results that significantly outperform the current state-of-the-art with a Concept Error Rate (CER) of 11.2%, instead of 13.6% for the last state-of-the-art system presented this year.
翻译:语言语言理解(SLU)专题在过去三年里取得了许多进展,出现了端到端神经方法。语言理解是指与语音信号的语义提取有关的自然语言处理任务,如在人类机器对话的背景下,从讲话或空档填充任务中被命名的实体识别任务。典型地,SLU任务是通过一个级联方法处理的,包括首先应用自动语音识别程序,然后是适用于自动抄录的自然语言处理模块。在过去三年里,基于深神经网络的端到端神经方法被提出来,目的是通过使用单一神经模型直接从语音信号中提取语义。最近,以无标签数据开放的新视角进行自我监管培训,在自动语音识别和自然语言处理的性能方面,采用无标签的新视角。在本文中,我们简要概述了法国MEDIA SLU 基准数据集的最新进展,使用或不使用其他数据。我们还介绍了我们最后的结果,大大超越了当前状态的语义信号,即13年的RIP-6%,并展示了当前状态的RIP-R-I-I-M-RIS-13年的状态。