The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible with traditional FST-based ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way: By regularizing the training via joint LAS training or parameter initialization from RNN-T, both MonoRNN-T and CTC-T perform as well or better than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set.
翻译:流出端到端自动语音识别的两个最受欢迎的损失函数是 RNN- Transporter (RNN-T) 和 连接器时间分类(CTC) 。 在这两种损失类型中,我们可以对单调 RNNN- T (MonONNN-T) 和最近提议的类似CTC的 Transporter (CTT) 进行分类。 单调式传感器有一些优点。 首先, RNNN- T 可能患有失控的幻觉, 一种模型在不及时推进的情况下会释放非空白符号。 其次, 单调式传感器每步要有一个完全的模型得分, 因此与传统的基于 FST 的 ASR 解码器更兼容。 然而, 迄今发现MONRNNT 的准确性比 RNNN- T 还要差。 它不一定是这样: 通过联合LAS 培训或参数初始化来规范培训, 莫诺- RNNNT 和 CT-T 都表现良好或优于 RNNNNT 。 这在LSpeech 和大规模内部数据集中被证明。