Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models. However, there exists a discrepancy on low-frequency words between the distilled and the original data, leading to more errors on predicting low-frequency words. To alleviate the problem, we directly expose the raw data into NAT by leveraging pretraining. By analyzing directed alignments, we found that KD makes low-frequency source words aligned with targets more deterministically but fails to align sufficient low-frequency words from target to source. Accordingly, we propose reverse KD to rejuvenate more alignments for low-frequency target words. To make the most of authentic and synthetic data, we combine these complementary approaches as a new training strategy for further boosting NAT performance. We conduct experiments on five translation benchmarks over two advanced architectures. Results demonstrate that the proposed approach can significantly and universally improve translation quality by reducing translation errors on low-frequency words. Encouragingly, our approach achieves 28.2 and 33.9 BLEU points on the WMT14 English-German and WMT16 Romanian-English datasets, respectively. Our code, data, and trained models are available at \url{https://github.com/alphadl/RLFW-NAT}.
翻译:知识蒸馏(KD)通常用于为培训非反向翻译(NAT)模型而构建合成数据。然而,蒸馏和原始数据之间在低频单词上存在差异,导致低频单词预测低频单词出现更多错误。为了缓解问题,我们利用培训前的杠杆手段,直接将原始数据暴露在NAT中。通过分析定向校正,我们发现KD将低频源词与目标更加确定一致,但未能将目标中足够低频单词与源次词相匹配。因此,我们提议反向KD为低频目标单词更新更多的校对。为了尽量利用真实和合成数据,我们将这些互补方法结合起来,作为进一步提高NAT绩效的新培训战略。我们在两个先进的结构上对5个翻译基准进行实验。结果显示,通过减少低频单词的翻译错误,拟议的方法可以显著和普遍提高翻译质量。令人鼓舞的是,我们的方法在WMT14、德国和WMT16罗马尼亚-RGBAT/MLF}MD、数据和训练的模型分别存在于罗马尼亚-RUBT/WLF}。