Non-autoregressive Transformers (NATs) significantly reduce the decoding latency by generating all tokens in parallel. However, such independent predictions prevent NATs from capturing the dependencies between the tokens for generating multiple possible translations. In this paper, we propose Directed Acyclic Transfomer (DA-Transformer), which represents the hidden states in a Directed Acyclic Graph (DAG), where each path of the DAG corresponds to a specific translation. The whole DAG simultaneously captures multiple translations and facilitates fast predictions in a non-autoregressive fashion. Experiments on the raw training data of WMT benchmark show that DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average, which is the first NAT model that achieves competitive results with autoregressive Transformers without relying on knowledge distillation.
翻译:非偏向变异器(NATs)通过平行生成所有标记来显著减少解码时的延迟。 但是,这种独立预测防止NATs捕捉到标记之间的依赖性以生成多种可能的译文。 在本文中,我们提议了定向环流转换器(DA-Transfomer),它代表着定向环形图(DAG)中隐藏的状态,其中DAG的每一个路径都与特定翻译相对应。整个DAG同时捕捉多种翻译,并便利以非侵略方式快速预测。 WMT基准的原始培训数据实验显示,DA-Transtraxent 平均比前3个BLEU(BLEU)大大超出先前的NATs,这是第一个NAT模式,在不依赖知识蒸馏的情况下与自动偏向变异器取得竞争结果。