This paper proposes a method to relax the conditional independence assumption of connectionist temporal classification (CTC)-based automatic speech recognition (ASR) models. We train a CTC-based ASR model with auxiliary CTC losses in intermediate layers in addition to the original CTC loss in the last layer. During both training and inference, each generated prediction in the intermediate layers is summed to the input of the next layer to condition the prediction of the last layer on those intermediate predictions. Our method is easy to implement and retains the merits of CTC-based ASR: a simple model architecture and fast decoding speed. We conduct experiments on three different ASR corpora. Our proposed method improves a standard CTC model significantly (e.g., more than 20 % relative word error rate reduction on the WSJ corpus) with a little computational overhead. Moreover, for the TEDLIUM2 corpus and the AISHELL-1 corpus, it achieves a comparable performance to a strong autoregressive model with beam search, but the decoding speed is at least 30 times faster.
翻译:本文建议采用一种方法,放松基于连接器时间分类(CTC)自动语音识别(ASR)模型的有条件独立假设。我们用基于CTC的ASR模型培训中间层的辅助性CTC损失,除了最初的CTC在最后一层的损失之外,还有中间层的辅助性CTC损失。在培训和推断过程中,中间层的预测都与下层的投入相提并论,以这些中间预测为预测最后一个层的条件。我们的方法很容易实施并保留基于CTC的ASR的优点:一个简单的模型架构和快速解码速度。我们用三种不同的ASR公司进行实验。我们提议的方法大大改进了标准的CTC模型(例如,在WSJ系统上减少20%以上的相对字差率),并略微地做了计算。此外,对于TEDLIUM2系统和AISHELL-1程序来说,其性能与一个强大的自动递增模型相似,进行波段搜索,但解码速度至少要快30倍。