Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer model that consistently reduces the input sequence length by 16 times, which results in speeding up the work of the intermediate layers. To solve the convergence problem with such a significant reduction of the time dimension, we use upsampling blocks similar to the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training. The Uconv-Conformer architecture appears to be not only faster in terms of training and inference but also shows better WER compared to the baseline Conformer. Our best Uconv-Conformer model showed 40.3% epoch training time reduction, 47.8%, and 23.5% inference acceleration on the CPU and GPU, respectively. Relative WER on Librispeech test_clean and test_other decreased by 7.3% and 9.2%.
翻译:优化现代 ASR 架构是最优先的任务之一,因为它为模型培训和推论节省了许多计算资源。 工作提议基于标准 Conv 模式建立一个新的 Uconv- Connecter 架构, 不断将输入序列长度缩短16次, 从而加速中间层的工作。 要解决趋同问题, 大量减少时间, 我们使用类似于 U-Net 架构的抽查块来确保正确的计算和稳定网络损失。 Uconv- Conferent 架构在培训和推论方面似乎不仅速度更快,而且显示与基准 Conzer 相比WER效果更好。 我们的最佳 Uconv- Consecter 模型显示, CPU 和 GPU 的加速度分别为40.3%、 47.8% 和 23.5% 。 Librispeech 测试和测试的相对WER 降低7.3% 和 9.2% 。