The computational benefits of iterative non-autoregressive transformers decrease as the number of decoding steps increases. As a remedy, we introduce Distill Multiple Steps (DiMS), a simple yet effective distillation technique to decrease the number of required steps to reach a certain translation quality. The distilled model enjoys the computational benefits of early iterations while preserving the enhancements from several iterative steps. DiMS relies on two models namely student and teacher. The student is optimized to predict the output of the teacher after multiple decoding steps while the teacher follows the student via a slow-moving average. The moving average keeps the teacher's knowledge updated and enhances the quality of the labels provided by the teacher. During inference, the student is used for translation and no additional computation is added. We verify the effectiveness of DiMS on various models obtaining improvements of up to 7 BLEU points on distilled and 12 BLEU points on raw WMT datasets for single-step translation. We release our code at https://github.com/layer6ai-labs/DiMS.
翻译:迭代非自动变压器的计算效益随着解码步骤数量的增加而减少。作为一种补救措施,我们引入了蒸馏多步骤(DIMS)这一简单而有效的蒸馏技术,以降低达到某种翻译质量所需的步骤的数量。蒸馏模型享有早期迭代的计算效益,同时保留了多个迭代步骤的增强。DIMS依靠两个模型,即学生和教师。在多次解码步骤之后,学生最优化地预测教师的产出,而教师则通过缓慢移动的平均数跟随学生。移动平均数使教师的知识不断更新,并提高教师提供的标签质量。在推断期间,学生被用于翻译,而没有增加计算。我们核查了DIMS在各种模型上的有效性,这些模型在蒸馏时改进了最多达7个BLEU点,在原始WMT数据集上的12个BLEU点,用于单步翻译。我们在https://github.com/lay6ai-labs/DimMS上公布了我们的代码。