Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.
翻译:在终端到终端语音处理中广泛使用顺序模型,例如自动语音识别、语音翻译和文本到语音(TTS),本文件侧重于一个名为变异器的突发序列到序列模型,该模型在神经机翻译和其他自然语言处理应用程序中达到最新性能。我们进行了密集研究,在其中我们实验了15个ASR、一个多语言ASR、一个ST和两个TTS基准的变异器和常规经常性神经网络(RNN)的变异器与分析。我们的实验揭示了与变异器为每项任务获得的各种培训提示和重大性能收益,包括在13/15 ASR基准中变异器的惊人优势,与RNN相比。我们正准备利用开放源和为社区提供的所有ASR、ST和TTS任务公开提供的可复制的数据集,释放Kaldi式配方,以取代我们激动人心的成果。