We propose a novel text-to-speech (TTS) data augmentation framework for low resource automatic speech recognition (ASR) tasks, named phoneme audio mix up (PAMP). The PAMP method is highly interpretable and can incorporate prior knowledge of pronunciation rules. Furthermore, PAMP can be easily deployed in almost any language, extremely for low resource ASR tasks. Extensive experiments have demonstrated the great effectiveness of PAMP on low resource ASR tasks: we achieve a \textbf{10.84\%} character error rate (CER) on the common voice Cantonese ASR task, bringing a great relative improvement of about \textbf{30\%} compared to the previous state-of-the-art which was achieved by fine-tuning the wav2vec2 pretrained model.
翻译:我们建议为低资源自动语音识别(ASR)任务建立一个创新的文本到语音数据增强框架(TTS),称为电话音频混合(PAMP)。PAMP方法非常可解释,可以包含先前的发音规则知识。此外,PAMP可以很容易地使用几乎任何一种语言,对于低资源ASR任务来说极其容易使用。 广泛的实验表明PAMP在低资源自动语音识别任务上非常有效:我们在通用的广东话音 ASR任务上实现了\ textb{30}字符误差率(CER),与以前通过微调Wav2vec2预培训模式而实现的艺术水平相比,对textbf{30] 有很大的相对改进。