This paper presents the use of non-autoregressive (NAR) approaches for joint automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. The proposed NAR systems employ a Conformer encoder that applies connectionist temporal classification (CTC) to transcribe the speech utterance into raw ASR hypotheses, which are further refined with a bidirectional encoder representations from Transformers (BERT)-like decoder. In the meantime, the intent and slot labels of the utterance are predicted simultaneously using the same decoder. Both Mask-CTC and self-conditioned CTC (SC-CTC) approaches are explored for this study. Experiments conducted on the SLURP dataset show that the proposed SC-Mask-CTC NAR system achieves 3.7% and 3.2% absolute gains in SLU metrics and a competitive level of ASR accuracy, when compared to a Conformer-Transformer based autoregressive (AR) model. Additionally, the NAR systems achieve 6x faster decoding speed than the AR baseline.
翻译:本文提出了使用非自回归 (NAR) 方法来处理联合自动语音识别 (ASR) 和口语理解 (SLU) 任务。所提出的 NAR 系统采用 Conformer 编码器将语音口令转录成原始的 ASR 假设;然后使用与 Bert 类似的双向编码器表示对其进行进一步的调整。在此同时,使用相同的编码器,预测口令的意图和槽标签。文中探讨了 Mask-CTC 和自我条件 CTC (SC-CTC) 方法。在 SLURP 数据集上进行的实验表明,所提出的 SC-Mask-CTC NAR 系统在 SLU 度量指标上取得了 3.7% 和 3.2% 的绝对提升,并具有与基于 Conformer-Transformer 的自回归 (AR) 模型相竞争的 ASR 准确性水平。此外,与 AR 基线相比,NAR 系统实现了 6 倍的更快译码速度。