Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which work well for large datasets, but tend to overfit when applied in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems. We present a novel approach of silence correction in the data pre-processing for TTS systems which increases the robustness when training on corpora targeted for ASR applications. In this work we do not only show the successful application of synthetic data for AED systems, but also test the same method on a highly optimized state-of-the-art Hybrid ASR system and a competitive monophone based system using connectionist-temporal-classification (CTC). We show that for the later systems the addition of synthetic data only has a minor effect, but they still outperform the AED systems by a large margin on LibriSpeech-100h. We achieve a final word-error-rate of 3.3%/10.0% with a Hybrid system on the clean/noisy test-sets, surpassing any previous state-of-the-art systems that do not include unlabeled audio data.
翻译:有关自动语音识别(ASR)的近期出版物非常关注对大型数据集有效,但在低资源情景下应用时往往过于适合的编码解密(AED)结构。 解决这个问题的一个解决办法是,如果有附加文本,则利用经过培训的文本到语音系统生成合成数据。 使用AED系统的许多出版物都成功应用了这一方法。 我们介绍了在TTS系统的数据预处理中采用静默校正的新做法,这在针对ASR应用的Corsora培训时提高了对AED系统的稳健性。 在这项工作中,我们不仅展示了合成数据对AED系统的成功应用,而且在应用低资源情景下也倾向于过度适用。 解决这个问题的一个解决办法是,在高度优化的状态的文本到语音识别系统(TTTS)中生成了经过培训的文本到语音识别系统(TTTSS)。 我们显示,对于后来的系统而言,添加合成数据的效果不大,但是在对AEDD系统的培训中仍然比AD系统更强。 在LiSpeech100h上,我们不仅展示了成功应用合成数据,而且还测试了AEDM- mission- mission- mission- main- dust anogilmentalmentalmentaltize the progy