The training of spoken language understanding (SLU) models often faces the problem of data scarcity. In this paper, we put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances. Furthermore, we investigate and propose solutions to two previously overlooked semi-supervised learning scenarios of data scarcity in SLU: i) Rich-in-Ontology: ontology information with numerous valid dialogue acts is given; ii) Rich-in-Utterance: a large number of unlabelled utterances are available. Empirical results show that our method can produce synthetic training data that boosts the performance of language understanding models in various scenarios.
翻译:口语理解模式的培训往往面临数据稀缺问题。在本文中,我们提出了一个数据增强方法,使用预先培训的语言模型来提高生成语句的可变性和准确性。此外,我们调查并提出了解决方案,以解决以前被忽视的两种半监督的学习方案,即SLU数据稀缺:一) 丰富本源学:提供了包含许多有效对话行为的本体学信息;二) 丰富内学信息:有大量的无标签语句。经验性结果表明,我们的方法可以产生合成培训数据,提高语言理解模式在各种情景中的性能。