Deploying an end-to-end automatic speech recognition (ASR) model on mobile/embedded devices is a challenging task, since the device computational power and energy consumption requirements are dynamically changed in practice. To overcome the issue, we present a training and pruning method for ASR based on the connectionist temporal classification (CTC) which allows reduction of model depth at run-time without any extra fine-tuning. To achieve the goal, we adopt two regularization methods, intermediate CTC and stochastic depth, to train a model whose performance does not degrade much after pruning. We present an in-depth analysis of layer behaviors using singular vector canonical correlation analysis (SVCCA), and efficient strategies for finding layers which are safe to prune. Using the proposed method, we show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU, while each pruned sub-model maintains the accuracy of individually trained model of the same depth.
翻译:在移动/装配装置上部署端到端自动语音识别模型是一项艰巨的任务,因为设备计算电能和能源消耗要求在实践中发生了动态变化。为了克服这一问题,我们根据连接器时间分类(CTC)为ASR提供了一个培训和剪裁方法,该方法允许在不做任何额外微调的情况下在运行时减少模型深度。为了实现这一目标,我们采用了两种正规化方法,即中间的CTC和蒸汽深度,以培训一个在剪裁后性能不会严重退化的模型。我们用单向矢量导感相关分析(SVICA)对层行为进行了深入分析,并提出了找到安全性能的层的有效战略。我们使用拟议方法表明,变压器-CTM模型可以在不同深度上剪裁,提高GPU的实时因子从0.005到0.002,同时每个经切割的子模型保持同一深度个人训练模型的准确性。