End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or fine-tuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterance-level representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting. Throughout the ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code is available at https://github.com/clovaai/textual-kd-slu.
翻译:端到端方法通过减轻传统管道系统的缺陷,为更准确、更高效的口语理解系统开辟了新的途径。 以前的作品通过通过通过自动语音识别或微调知识蒸馏的预培训,为SLU模型开发文本信息。 为了更有效地利用文本信息, 这项工作提议了一种两阶段文本知识蒸馏方法, 与预培训和微调期间的发声级别表达和两种模式预测记录相匹配, 顺序顺序。 我们使用 vq-wav2vec BERT作为语音编码器, 因为它能捕捉一般和丰富的特性。 此外, 我们通过随机遮盖离散音符和背景化隐含演示的数据增强方法, 来改进SLUU模型的文本信息。 因此, 我们在流言指令指令上推推进状态, 达到99.7%的全数据集设置测试准确度, 10%的子集设置为99.5% 。 在整个分析研究过程中, 我们通过实验性核查所有使用的方法对于最后的性能至关重要, 特别是低资源情景, 数据增强方法, 提供最佳的语音码/ 。