This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task. We focus on the efficient implementation of deep Transformer models \cite{wang-etal-2019-learning, li-etal-2019-niutrans} using NiuTensor (https://github.com/NiuTrans/NiuTensor), a flexible toolkit for NLP tasks. We explored the combination of deep encoder and shallow decoder in Transformer models via model compression and knowledge distillation. The neural machine translation decoding also benefits from FP16 inference, attention caching, dynamic batching, and batch pruning. Our systems achieve promising results in both translation quality and efficiency, e.g., our fastest system can translate more than 40,000 tokens per second with an RTX 2080 Ti while maintaining 42.9 BLEU on \textit{newstest2018}. The code, models, and docker images are available at NiuTrans.NMT (https://github.com/NiuTrans/NiuTrans.NMT).
翻译:本文介绍了NiuTrans小组向WNGT 2020效率共享任务提交的呈件,我们的重点是利用NiuTor(https://github.com/NiuTrans/NiuTensor),利用Niutor(https://github. com/NiuTrans/NiuTensor),利用NLP任务的灵活工具包,高效率地实施深变式模型\cite{WNGT 2020效率共享任务。我们探讨了通过模型压缩和知识蒸馏在变异模型中深海编码器和浅变异码器的结合。神经机转换还得益于FP16的推断、注意力堆积、动态批发和批发。我们的系统在翻译质量和效率两方面都取得了大有希望的结果,例如,我们最快的系统可以用RTX 2080 Ti(RTX 2080 Ti) 翻译每秒40,同时在\ textit{newstest2018} 上保持42.9 BLEUEU。该代码、模型和docker图像可在NiTransy.NMTT(http://gith.com/Nutrans/Nutrax/Nutransy)查阅。