Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer model. It consistently reduces the input sequence length by 16 times, which results in speeding up the work of the intermediate layers. To solve the convergence issue connected with such a significant reduction of the time dimension, we use upsampling blocks like in the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training. The Uconv-Conformer architecture appears to be not only faster in terms of training and inference speed but also shows better WER compared to the baseline Conformer. Our best Uconv-Conformer model shows 47.8% and 23.5% inference acceleration on the CPU and GPU, respectively. Relative WER reduction is 7.3% and 9.2% on LibriSpeech test_clean and test_other respectively.
翻译:优化现代 ASR 架构是最优先的任务之一,因为它为模型培训和推断节省了许多计算资源。 工作提议基于标准 Conv 模式建立一个新的 Uconv- Connecter 架构。 它一贯将输入序列长度减少16次, 从而加快中间层的工作。 要解决与大量减少时间维度相关的趋同问题, 我们使用U-Net 架构中的抽查块来确保正确的计算损失计算和稳定网络培训。 Uconv- Connecter 架构不仅在培训和推断速度方面速度更快,而且显示与基线 Confer 相比, WER也更好。 我们最好的 Uconv- Connecter 模型分别显示在 CPU 和 GPU 上加速了47.8% 和23.5% 。 在 LibriSpeech 测试- clean and test_ other上, 相对的WER降幅分别为7.3%和9.2%。