通过多语言培训、转让学习、文本到文本绘图和合成音频,促进终端到终端的ASR系统 (Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio)

Bootstrapping speech recognition on limited data resources has been an area of active research for long. The recent transition to all-neural models and end-to-end (E2E) training brought along particular challenges as these models are known to be data hungry, but also came with opportunities around language-agnostic representations derived from multilingual data as well as shared word-piece output representations across languages that share script and roots. We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer (RNN-T) based automatic speech recognition (ASR) system in the low resource regime, while exploiting the abundant resources available in other languages as well as the synthetic audio from a text-to-speech (TTS) engine. Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements, allowing us to bootstrap a model for a new language with a fraction of the data that would otherwise be needed. The best system achieved a 46% relative word error rate (WER) reduction compared to the monolingual baseline, among which 25% relative WER improvement is attributed to the post-ASR text-to-text mappings and the TTS synthetic data.

翻译：对有限数据资源进行强化语音识别是长期积极研究的一个领域。最近向全自然模型和端到端培训的过渡带来了特殊的挑战,因为这些模型已知数据饥饿,但也带来了从多语言数据中产生的语言认知代表机会,以及不同语言共享文字和根基的共享字形输出演示。我们在这里调查了不同战略在低资源制度中引入基于RNN-Transducer(RNN-T)的自动语音识别系统(ASR)系统的有效性,同时利用了其他语言的丰富资源以及从文本到语音引擎的合成音频。我们的实验表明,从多语言模型中学习的转移,使用ASR后文本到文字的绘图和合成音频添加改进,使我们能够将新语言的模型与部分数据捆绑起来,否则需要这些数据。最佳系统比单一语言基线减少了46%相对字词识别错误率(WER),其中25 %的相对合成文本改进归因于后合成语音和合成文本。