We propose an end-to-end ASR system that can be trained on transcribed speech data, text data, or a mixture of both. For text-only training, our extended ASR model uses an integrated auxiliary TTS block that creates mel spectrograms from the text. This block contains a conventional non-autoregressive text-to-mel-spectrogram generator augmented with a GAN enhancer to improve the spectrogram quality. The proposed system can improve the accuracy of the ASR model on a new domain by using text-only data, and allows to significantly surpass conventional audio-text training by using large text corpora.
翻译:我们建议一个端到端 ASR 系统, 该系统可以就转录语音数据、 文本数据或两者的混合进行训练。 在只进行文本培训时, 我们的扩展 ASR 模型使用一个综合的辅助 TTS 块, 从文本中创建中光谱图。 这个块包含一个常规的不偏向文本到熔炼的光谱生成器, 并配上一个GAN 增强器, 以提高光谱质量。 提议的系统可以通过使用只文本数据来提高新域的 ASR 模型的准确性, 并且能够通过使用大型文本子公司大大超过常规的音频文本培训 。</s>