We propose a new framework to improve automatic speech recognition (ASR) systems in resource-scarce environments using a generative adversarial network (GAN) operating on acoustic input features. The GAN is used to enhance the features of mismatched data prior to decoding, or can optionally be used to fine-tune the acoustic model. We achieve improvements that are comparable to multi-style training (MTR), but at a lower computational cost. With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER). Experiments demonstrate that the framework can be very useful in under-resourced environments where training data and computational resources are limited. The GAN does not require parallel training data, because it utilises a baseline acoustic model to provide an additional loss term that guides the generator to create acoustic features that are better classified by the baseline.
翻译:我们提出一个新的框架,用一个使用声学输入特征操作的基因对抗网络(GAN)来改进资源匮乏环境中自动语音识别系统。GAN用于在解码前增强不匹配数据的特点,或可选用于微调声学模型。我们取得了与多式培训(MTR)相类似的改进,但计算成本较低。用不到1小时的数据,ASR系统就高质量的数据进行了培训,对不匹配音频进行了评估,改进了11.5%至19.7%的相对字词错误率(WER)。实验表明,该框架在资源不足的环境中非常有用,因为培训数据和计算资源有限。GAN不需要平行的培训数据,因为它使用一个基线声学模型来提供额外的损失术语,指导发电机创建更好的按基线分类的声学特征。