Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems, but only very limited in the context of other ASR architectures. We investigate the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training. Additionally, we also consider internal language model subtraction for the first time, resulting in up to 38% relative improvement. We compare the AED results to a state-of-the-art hybrid ASR system, a monophone based system using connectionist-temporal-classification (CTC) and a monotonic transducer based system. We show that for the later systems the addition of synthetic data has no relevant effect, but they still outperform the AED systems on LibriSpeech-100h. We achieve a final word-error-rate of 3.3%/10.0% with a hybrid system on the clean/noisy test-sets, surpassing any previous state-of-the-art systems on Librispeech-100h that do not include unlabeled audio data.
翻译:有关自动语音识别(ASR)的近期出版物非常侧重于关注在低资源情景下往往因超装而受过度设计的编码解密(AED)结构。 解决这一问题的一个解决办法是,如果有附加文本,则以经过培训的文本到语音系统生成合成数据。 这在使用AED系统的许多出版物中得到了成功应用,但在其他ASR结构中却非常有限。 我们调查了不同预处理、演讲者嵌入和输入 TTS 系统 w.r.t. 的音频编码的影响。 用于AED- ASR 培训的合成数据的有效性。 此外,我们还首次考虑内部语言模型减值,导致高达38%的相对改进。 我们将AED结果与最新版本的混合 ASR系统进行比较,一个使用连接-时级化(CTC)和以单声器为基础的系统。 我们显示,对于后来的系统,合成数据添加的不相关效果,但是它们仍然超过AED-1/1/1/ASR的系统。