Non-autoregressive neural machine translation (NAT) usually employs sequence-level knowledge distillation using autoregressive neural machine translation (AT) as its teacher model. However, a NAT model often outputs shorter sentences than an AT model. In this work, we propose sequence-level knowledge distillation (SKD) using perturbed length-aware positional encoding and apply it to a student model, the Levenshtein Transformer. Our method outperformed a standard Levenshtein Transformer by 2.5 points in bilingual evaluation understudy (BLEU) at maximum in a WMT14 German to English translation. The NAT model output longer sentences than the baseline NAT models.
翻译:非潜移心神经机器翻译(NAT)通常使用自动递减神经机器翻译(AT)作为教师模型,进行序列级知识蒸馏,但是,NAT模型的刑期往往比AT模型短。在这项工作中,我们建议使用扰动长视距定位编码进行序列级知识蒸馏,并将其应用于学生模型,即Levenshtein变异器。我们的方法在双语评估基础研究(WMT14 德文至英文翻译)中比标准Levenshtein变异器高出2.5个百分点,最高为WMT14德文至英文翻译。NAT模型输出的刑期比基线NAT模型长。