With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. Synthetic datasets may contain artifacts that do not exist in real data such as structured noise, content errors, or unrealistic speaking styles. Moreover, the synthesis process may introduce a bias due to uneven sampling of the data manifold. We propose two novel techniques during training to mitigate the problems due to the distribution gap: (i) a rejection sampling algorithm and (ii) using separate batch normalization statistics for the real and the synthetic samples. We show that these methods significantly improve the training of speech recognition models using synthetic data. We evaluate the proposed approach on keyword detection and Automatic Speech Recognition (ASR) tasks, and observe up to 18% and 13% relative error reduction, respectively, compared to naively using the synthetic data.
翻译:随着语音合成的最新进展,合成数据正在成为培训语音识别模型实际数据的一种可行替代方法,然而,由于合成数据与实际数据分布之间的差距,使用合成数据进行机器学习并非微不足道。合成数据集可能包含在真实数据中不存在的工艺品,如结构噪音、内容错误或不现实的语音风格。此外,合成过程可能会由于数据组合抽样不均而产生偏差。我们提议在培训期间采用两种新的技术来缓解分配差距造成的问题:(一) 拒绝抽样算法,以及(二) 对真实样本和合成样本分别使用批次正常化统计。我们表明,这些方法大大改进了使用合成数据对语音识别模型的培训。我们评估了关键词检测和自动语音识别任务的拟议方法,并观察到与使用合成数据天真的相比,分别减少了18%和13%的相对错误。