The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts. Advances in end-to-end (E2E) speech modeling have made it possible to train solely on semantic entities, which are far cheaper to collect than verbatim transcripts. We focus on this set prediction problem, where entity order is unspecified. Using two classes of E2E models, RNN transducers and attention based encoder-decoders, we show that these models work best when the training entity sequence is arranged in spoken order. To improve E2E SLU models when entity spoken order is unknown, we propose a novel data augmentation technique along with an implicit attention based alignment method to infer the spoken order. F1 scores significantly increased by more than 11% for RNN-T and about 2% for attention based encoder-decoder SLU models, outperforming previously reported results.
翻译:口语理解系统(SLU)的目标是确定输入语音信号的含义,与旨在制作逐字记录稿的语音识别不同。端到端(E2E)语音模型的进展使得有可能仅对语义实体进行培训,这些实体的收集费用远比逐字记录稿便宜。我们侧重于这一设定的预测问题,即实体顺序未说明的问题。我们使用两种E2E模型,即RNN Tenters和关注源编码器-解码器,显示这些模型在培训实体序列按口述顺序排列时最有效。为了改进E2E SLU模型,当实体的语音顺序不详时,我们建议采用新的数据增强技术,同时采用以隐含的注意为基础的调整方法来推断口语顺序。F1的分数明显增加11%以上,而基于编码-代码SLU模型的注意率则增加2%左右,超过了以前报告的结果。