End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation. However, these systems model sequence labeling as a sequence prediction task causing a divergence from its well-established token-level tagging formulation. We build compositional end-to-end SLU systems that explicitly separate the added complexity of recognizing spoken mentions in SLU from the NLU task of sequence labeling. By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations that can be used in the traditional sequence labeling framework. This composition of ASR and NLU formulations in our end-to-end SLU system offers direct compatibility with pre-trained ASR and NLU systems, allows performance monitoring of individual components and enables the use of globally normalized losses like CRF, making them attractive in practical scenarios. Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition across SLU benchmarks.
翻译:端至端口语理解系统由于其简单和避免错误传播的能力,正在逐渐受到级联方法的欢迎。然而,这些系统模型序列标记为序列预测任务,导致与既定的象征性标记配方有差异。我们建立成份端至端SLU系统,明确区分SLU在序列标签任务中提及口述的新增复杂性。依靠为ASR培训的中间解码器,我们的端至端系统将输入模式从语音转换为在传统序列标签框架中可以使用的象征性表示。在我们的端至端SLU系统中,ASR和NLU配方的构成与预先培训的ASR和NLU系统直接兼容,允许对个别部件进行性能监测,并允许使用通用报告格式等全球标准化损失,使其在实际情景中具有吸引力。我们的模型超越了在SLU基准中标定实体识别标签任务上的级联式和直接端至端模式。