End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR) especially for Japanese since word-based tokenization of Japanese is not trivial, and E2E modeling is able to model character sequences directly. This paper focuses on the latest E2E modeling techniques, and investigates their performances on character-based Japanese ASR by conducting comparative experiments. The results are analyzed and discussed in order to understand the relative advantages of long short-term memory (LSTM), and Conformer models in combination with connectionist temporal classification, transducer, and attention-based loss functions. Furthermore, the paper investigates on effectivity of the recent training techniques such as data augmentation (SpecAugment), variational noise injection, and exponential moving average. The best configuration found in the paper achieved the state-of-the-art character error rates of 4.1%, 3.2%, and 3.5% for Corpus of Spontaneous Japanese (CSJ) eval1, eval2, and eval3 tasks, respectively. The system is also shown to be computationally efficient thanks to the efficiency of Conformer transducers.
翻译:终端到终端( E2E) 建模有利于自动语音识别( ASR), 特别是日本语的自动语音识别( ASR ), 因为日本语的单词符号化并非微不足道, E2E 建模能够直接模拟字符序列。 本文侧重于最新的 E2E 建模技术, 并通过比较实验调查其在基于字符的日本 ASR 上的性能。 分析和讨论结果是为了了解长期短期内存( LSTM ) 和连接模型的相对优势, 结合连接时间分类、 转换器和关注损失功能。 此外, 本文还调查了最近培训技术的效果, 如数据增强( 分级)、 变动噪声注入和指数移动平均。 文件中发现的最佳配置分别达到了4. 1%、 3.2% 和3.5% 的日本Spontaneous Compus eval1 eval2 eval2 和 eval3 任务。 该系统还显示, 由于 Contraction transfecters 的效率, 具有计算效率。