Automatic Speech Recognition (ASR) is a key element in new services that helps users to interact with an automated system. Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English. However, the use of these methods is only available for languages with hundreds or thousands of hours of audio and their corresponding transcriptions. For the so-called low-resource languages to speed up the availability of resources that can improve the performance of their ASR systems, methods of creating new resources on the basis of existing ones are being investigated. In this paper we describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages. We carry out experiments developing an ASR for Quechua using the wav2letter++ model. We reduced WER by 8.73% through our approach to the base model. The resulting ASR model obtained 22.75% WER and was trained with 99 hours of original resources and 99 hours of synthetic data obtained with a combination of text augmentation and synthetic speech generati
翻译:新服务中的一项关键要素是自动语音识别(ASR)是帮助用户与自动化系统互动的新服务中的一个关键要素。深入的学习方法使得能够对英语的ASR部署字误差率低于5%的系统。然而,这些方法的使用只适用于有数百或数千小时音频及其相应抄录的语文。对于所谓的低资源语言,以加快资源的供应,改善他们的ASR系统的性能,正在调查在现有系统的基础上创造新资源的方法。在本文中,我们描述了我们的数据增强方法,以改进低资源和混杂语言的ASR模型的结果。我们用 wav2le+++模型为Quechua开发ASR进行实验。我们通过基本模型的方法,将WER减少了8.73%。因此产生的ASR模型获得了22.75%的WER,并用99小时的原始资源和99小时的合成数据进行了培训,这些合成数据结合了文本增强和合成语音基因。