Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT'14 English-German translation, we match the accuracy of (Vaswani et al 2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 91 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset.
翻译:序列学习模式的顺序仍需要几天时间才能达到使用一台单一机器的大型基准数据集的最新性能。 本文显示,降低精确度和大批量培训可以加快培训,近5x对一台8-GPU机进行仔细调整和实施。 关于WMT'14英语-德语翻译,当8GPU培训时,我们比对了5小时以下(Vaswani等人,2017年)的准确性,在128GPU培训91分钟后,我们获得了29.3 BLEU的新水平。 我们通过培训大得多的帕拉克劳尔数据集,将这些结果进一步提高到29.8 BLEU。