Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, the augmented data sets are often noisy, and thus impede the performance of SLU models. In this paper we focus on mitigating noise in augmented data. We develop a denoising training approach. Multiple models are trained with data produced by various augmented methods. Those models provide supervision signals to each other. The experimental results show that our method outperforms the existing state of the art by 3.05 and 4.24 percentage points on two benchmark datasets, respectively. The code will be made open sourced on github.
翻译:缺乏培训数据是将口语理解扩大到低资源语言的重大挑战,虽然提出了各种数据增强方法,以综合低资源目标语言的培训数据,但扩大的数据集往往噪音很大,从而妨碍SLU模型的运行。在本文件中,我们侧重于在增加的数据中减少噪音。我们开发了一种分层培训方法。多模型经过各种扩大方法生成的数据培训。这些模型相互提供监督信号。实验结果显示,我们的方法分别比两个基准数据集的现有水平高出3.05和4.24个百分点。代码将在 Github 上公开源码。