Autoregressive (AR) and Non-autoregressive (NAR) models have their own superiority on the performance and latency, combining them into one model may take advantage of both. Current combination frameworks focus more on the integration of multiple decoding paradigms with a unified generative model, e.g. Masked Language Model. However, the generalization can be harmful to the performance due to the gap between training objective and inference. In this paper, we aim to close the gap by preserving the original objective of AR and NAR under a unified framework. Specifically, we propose the Directional Transformer (Diformer) by jointly modelling AR and NAR into three generation directions (left-to-right, right-to-left and straight) with a newly introduced direction variable, which works by controlling the prediction of each token to have specific dependencies under that direction. The unification achieved by direction successfully preserves the original dependency assumption used in AR and NAR, retaining both generalization and performance. Experiments on 4 WMT benchmarks demonstrate that Diformer outperforms current united-modelling works with more than 1.5 BLEU points for both AR and NAR decoding, and is also competitive to the state-of-the-art independent AR and NAR models.
翻译:目前的组合框架更侧重于将多种解码范式与统一的基因模型(如蒙面语言模型)相结合。然而,由于培训目标和推理之间的差距,一般化可能有害于业绩。在本文件中,我们的目标是通过在一个统一的框架内维护AR和NAR的最初目标来缩小差距。具体地说,我们建议方向变换器(Difrener)联合模拟AR和NAR的三代方向(左向右、右向右和直向),并采用一个新的方向变量,通过控制对每种象征的预测,从而在这一方向下有具体依赖性。通过方向实现的统一成功地保留了AR和NAR最初使用的依赖性假设,保留了一般化和业绩。对WMT基准的实验表明,在目前的统一变式工作上,将AR和ARNAR的1.5个以上BLEU点与NAR的独立运行点。