Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.
翻译:话语理解(SLU)任务在语言研究界已经研究了几十年,但没有得到与低层次任务一样的关注,如语音和语音识别等。特别是,没有近乎多的SLU任务基准,许多现有基准使用所有研究人员无法免费获得的数据。最近的工作已经开始为若干任务采用这种基准数据集。在这项工作中,我们引入了几个基于可自由获取的言论数据、补充现有基准并弥补SLU评价格局差距的附加说明的SLU基准任务。我们贡献了四个任务:回答和概括涉及对更长的语音序列的推断;指定的实体本地化涉及在信号中定位目标内容的具体语言任务;对话行为分类确定了特定言论表述的功能。我们遵循了Spoken语言理解评价(SLUE)基准套件的蓝图。为了便利发展SLU模型,以利用经事先培训的语音表述的成功,我们将为每项任务发布(i)关于相对小的微的语音感应度的语音感应变顺序的说明,(ii)用于相对小的20级的语音感应感应变模型,(i)我们对当前发展基准和详细度的比较和测试(view) 和测试(view) (我们) 和数据收集基准和测试模型的确认) 。