We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available but not corresponding audio. We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition (ASR) systems, followed by an SLU adaptation step. In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models. When evaluated on two SLU data sets, the ATIS corpus and a customer call center data set, the proposed models closely track the performance of other E2E models and achieve state-of-the-art results.
翻译:我们提出了一份关于建立和调整RNNN Tenter (RNN-T) 口语理解模式的综合研究。这些端到端模式建在三种实际设置中:一个有逐字记录的个案,一个唯一的现有说明是SLU标签及其价值的受限案例,以及一个备有记录但非相应音频的更具限制性的案例。我们展示了如何从预先培训的自动语音识别系统(ASR)开始开发RNN-T SLU模式,然后是SLU适应步骤。在没有真实音频数据的环境中,人工合成的语音被用于成功调整各种SLU模式。在对两个SLU数据集(ATISCample和客户呼叫中心数据集)进行评估时,拟议模式密切跟踪其他E2E模型的性能并实现最新结果。