Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where generation is sequential, the former allows generation to be parallelized across target token positions. Some of the latest non-autoregressive models have achieved impressive translation quality-speed tradeoffs compared to autoregressive baselines. In this work, we reexamine this tradeoff and argue that autoregressive baselines can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed. We show that the speed disadvantage for autoregressive baselines compared to non-autoregressive methods has been overestimated in three aspects: suboptimal layer allocation, insufficient speed measurement, and lack of knowledge distillation. Our results establish a new protocol for future research toward fast, accurate machine translation. Our code is available at https://github.com/jungokasai/deep-shallow.
翻译:最近,我们投入了大量精力,在非自动侵蚀神经机器翻译方面进行了大量努力,这似乎是现代GPU上最先进的自动递减机器翻译的一种有效替代物。与现代GPU相比,现代GPU上最先进的自动递减机器翻译是一种高效的替代物。与现代GPU上最先进的自动递减机器翻译相比,前者允许在目标符号位置上将一代相平行。与自动递减基线相比,一些最新的非自动递减模型已经实现了令人印象深刻的翻译质量-速度权衡。在这项工作中,我们重新审查了这一权衡,并争论说,自动递减基线在三个方面可以大大加速,而不会丧失准确性。具体地说,我们用不同深度的编码和分解器研究自动递增模型研究自动递增模型。我们的广泛实验显示,给一个足够深的编码,一个单层自动递减式的自动递减解变码可以大大超过具有可比性的强大非递增性模型,且具有可比性的推移速度。我们显示,自动递增基线相对于非递增方法的速度劣势的基线在三个方面被过高地估计:次层分配、速度不足、速度测量测量、速度测量、缺乏、缺少和深层测量,以及缺乏精密。我们的数据正在建立新的程序。