End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device. However, the limited number of public audio datasets with semantic parse labels hinders the research progress in this area. In this paper, we release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. Additionally, we define low-resource splits to establish a benchmark for improving SLU when limited labeled data is available. Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems. Initial experimentation show end-to-end SLU models performing slightly worse than their cascaded counterparts, which we hope encourages future work in this direction.
翻译:终端到终端口语理解(SLU)直接通过使用单一模型的音频来预测意向。 它保证通过利用中间文本代表中丢失的声学信息来改善辅助系统的性能,防止自动语音识别(ASR)的层层错误。 此外,在安装辅助系统时,使用一个统一的模型具有效率优势。然而,带有语义分析标签的公共音频数据集数量有限,阻碍了这一领域的研究进展。在本文件中,我们发布了Spoken-Triked-Orent- Sermantic 剖析(STOP)数据集,这是最大和最复杂的 SLU数据集,可以公开提供。此外,我们界定了低资源分割,以便在有有限的标签数据可用时为改进SLU建立一个基准。此外,除了人类录音外,我们还发行了TS生成的版本,以衡量低资源域对终端到终端SLU系统进行调整的性能。 初步实验显示终端到终端SLU模型比其级联的相差一些,我们希望鼓励今后朝这一方向开展工作。