As one of the most popular sequence-to-sequence modeling approaches for speech recognition, the RNN-Transducer has achieved evolving performance with more and more sophisticated neural network models of growing size and increasing training epochs. While strong computation resources seem to be the prerequisite of training superior models, we try to overcome it by carefully designing a more efficient training pipeline. In this work, we propose an efficient 3-stage progressive training pipeline to build highly-performing neural transducer models from scratch with very limited computation resources in a reasonable short time period. The effectiveness of each stage is experimentally verified on both Librispeech and Switchboard corpora. The proposed pipeline is able to train transducer models approaching state-of-the-art performance with a single GPU in just 2-3 weeks. Our best conformer transducer achieves 4.1% WER on Librispeech test-other with only 35 epochs of training.
翻译:作为最受欢迎的语音识别顺序到顺序的模型方法之一,RNN-Transinger实现了不断演化的性能,其神经网络模型越来越先进,其规模越来越大,培训时代也越来越多。虽然强大的计算资源似乎是培训高级模型的先决条件,但我们试图通过仔细设计一个效率更高的培训管道来克服这一点。在这项工作中,我们提议了一个高效的三阶段渐进培训管道,以便在合理的短时期内,利用非常有限的计算资源,从零开始建立高性能神经传感器模型。每个阶段的有效性都通过Librispeech和交换板公司进行实验性核查。拟议的管道能够在短短2-3周的时间里用单一的GPU来培训接近最先进性能的传感器模型。我们最好的匹配器在Librispeech测试中获得了4.1%的WER,而其他测试只有35个学系的培训。