The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are the RNN-Transducer (RNN-T) and the connectionist temporal classification (CTC) objectives. Both perform an alignment-free training by marginalizing over all possible alignments, but use different transition rules. Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T), which both can be realized using the graph temporal classification-transducer (GTC-T) loss function. Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time, often in an infinite loop. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible and unifiable with traditional FST-based hybrid ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way, though: By regularizing the training - via joint LAS training or parameter initialization from RNN-T - both MonoRNN-T and CTC-T perform as well - or better - than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set.
翻译:流出端到端自动语音识别的两个最受欢迎的损失函数是 RNN- Transporter (RNN-Tradinger-Tradinger-T) 和连接器时间分类(CTC) 的目标。 两者都通过在所有可能的对齐上进行边化, 但使用不同的过渡规则来进行不协调的培训。 在这两种损失类型中,我们可以将单调 RNN- T (MonORNNN-T) 和最近提议的类似CT- Transporter (CT-T) 分类(CT-T) 进行分类, 两者都可以使用图形时间分类- Transporter (GTC-T) 损失函数来实现。 单调转换器转换器具有少数的优势。 首先, RNNNT 可能会遭受失控的幻觉, 模型在不延时不推进的情况下, 使用不同的过渡规则。 第二, 单调式传感器每一步就消耗一个完全的模型,因此与传统的基于FST的混合 SR- 解算器的混合解算器。 然而, MON- NRCNNT 迄今比 R- 的大规模培训或MNF- 都比初始化为正常化, 。 它不是通过常规化, 运行中显示一个更好的数据。