In this work, we empirically confirm that non-autoregressive translation with an iterative refinement mechanism (IR-NAT) suffers from poor acceleration robustness because it is more sensitive to decoding batch size and computing device setting than autoregressive translation (AT). Inspired by it, we attempt to investigate how to combine the strengths of autoregressive and non-autoregressive translation paradigms better. To this end, we demonstrate through synthetic experiments that prompting a small number of AT's predictions can promote one-shot non-autoregressive translation to achieve the equivalent performance of IR-NAT. Following this line, we propose a new two-stage translation prototype called hybrid-regressive translation (HRT). Specifically, HRT first generates discontinuous sequences via autoregression (e.g., make a prediction every k tokens, k>1) and then fills in all previously skipped tokens at once in a non-autoregressive manner. We also propose a bag of techniques to effectively and efficiently train HRT without adding any model parameters. HRT achieves the state-of-the-art BLEU score of 28.49 on the WMT En-De task and is at least 1.5x faster than AT, regardless of batch size and device. In addition, another bonus of HRT is that it successfully inherits the good characteristics of AT in the deep-encoder-shallow-decoder architecture. Concretely, compared to the vanilla HRT with a 6-layer encoder and 6-layer decoder, the inference speed of HRT with a 12-layer encoder and 1-layer decoder is further doubled on both GPU and CPU without BLEU loss.
翻译:在这项工作中,我们从经验上证实,具有迭代完善机制(IR-NAT)的非偏向翻译缺乏加速性强,因为它比自动递减翻译(AT)对解码批量大小和计算设备设置比自动递减翻译(AT)更加敏感。受它启发,我们试图调查如何将自动递减和非递减翻译模式的优点更好地结合起来。为此,我们通过合成实验表明,促使少量AT的预测可以促进一发非递增翻译,从而实现IR-NAT的同等性能。在此行之后,我们提议一个新的两阶段翻译原型,称为混合递减翻译(HRT)。具体地说,HRT首先通过自动递增生成不连续的序列(例如,以非递增方式一次预测每个K符号,K>1),然后填充所有先前的代记。我们还提议用一套技术来有效和高效地训练HRT,同时不增加任何模型参数。 HRT TO 和 NE-RT在不增加 IM 和 IMB- IM 中, 将一个最高级的H- dex 和另一个H- deal 的 IMD 级的机级的机级比H- dex 更快速的H- dex 更快速的H- dex 更快速的H- dex 和另一个的H- dex 级的H- dex 级的机级的机级的机级的机级,是另一个的H- d- d- d- d- d- d- d- d- d-x 的H- d- d-x 的H- d-x-x 的机级的H- d-x 级的机级的H- d- 的机级的机级的H-x-x-x-x-x-x-x-x- d- d- d- d- d- d-xxxx-x-x-x-x-x-x-xxxx-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x- d-x-x-x-x- 和 和 和 和 和 和 和 和 和 和 和 的H- d-x-