This paper describes the submission of the RoyalFlush neural machine translation system for the WMT 2022 translation efficiency task. Unlike the commonly used autoregressive translation system, we adopted a two-stage translation paradigm called Hybrid Regression Translation (HRT) to combine the advantages of autoregressive and non-autoregressive translation. Specifically, HRT first autoregressively generates a discontinuous sequence (e.g., make a prediction every $k$ tokens, $k>1$) and then fills in all previously skipped tokens at once in a non-autoregressive manner. Thus, we can easily trade off the translation quality and speed by adjusting $k$. In addition, by integrating other modeling techniques (e.g., sequence-level knowledge distillation and deep-encoder-shallow-decoder layer allocation strategy) and a mass of engineering efforts, HRT improves 80\% inference speed and achieves equivalent translation performance with the same-capacity AT counterpart. Our fastest system reaches 6k+ words/second on the GPU latency setting, estimated to be about 3.1x faster than the last year's winner.
翻译:本文描述了为 WMT 2022 翻译效率任务提交皇家花旗神经机器翻译系统的情况。 与常用的自动递减翻译系统不同, 我们采用了两阶段翻译模式, 称为混合递减翻译(HRT ), 将自动递减和非自动递增翻译的优势结合起来。 具体地说, HRT 首次自动递减生成一个不连续序列( 例如, 预测每批美元象征性, $k>1美元), 然后以非自动递增方式一次填充所有先前跳出过的标牌。 因此, 我们可以通过调整 $k$ 来轻松交换翻译质量和速度。 此外, 通过整合其他模型技术( 如, 序列级知识蒸馏和深相电离分解层层配置战略) 和工程工作质量, HRT 提高80 ⁇ 推算速度, 并实现与 AT 等同能力的同等翻译性能。 我们最快的系统在 GPU Lantency 设置上达到 6k + 字/ 秒。 。 据估计, 将比去年 311 赢者 更快。