Spoken language understanding (SLU) tasks involve mapping from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with limited labeled data. In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond self-supervised pre-training, how can we use external speech and/or text data that are not annotated for the task? We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline (speech recognition followed by text NER model) approaches. We find that several of these approaches improve performance in resource-constrained settings beyond the benefits from pre-trained representations alone. Compared to prior work, we find improved F1 scores of up to 16%. While the best baseline model is a pipeline approach, the best performance when using external data is ultimately achieved by an end-to-end model. We provide detailed comparisons and analyses, showing for example that end-to-end models are able to focus on the more NER-specific words.
翻译:语言理解(SLU)任务涉及从语音音频信号到语义标签的绘图。鉴于这些任务的复杂性,预期良好的表现要求有大量的标签数据集,而每个新任务和领域很难收集。然而,在自我监督的语音陈述方面最近的进展使得考虑学习带有有限标签数据的SLU模型成为可行。在这项工作中,我们侧重于低资源口语实体识别(NER),并处理以下问题:除了自我监督的训练前培训之外,我们如何使用不为任务附加注释的外部语音和/或文本数据?我们借鉴各种办法,包括自我培训、知识提炼和转让学习,并考虑这些办法对端对端模式和管道(以文本NER模型为主的语音识别)方法的适用性。我们发现,这些办法中的一些办法不仅改善了事先经过培训的表述的好处,还改善了资源限制环境下的业绩。与以前的工作相比,我们发现F1的分数增加到16 %。尽管最佳基线模型是一种管道式方法,但我们在使用外部数据最终实现分析时能够提供最佳的成绩,我们通过最终的榜样展示了最终分析,我们能够提供最佳的成绩,从而最终显示最佳的样品。