Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. However, the deployment of Transformer is challenging because different scenarios require models of different complexities and scales. Naively training multiple Transformers is redundant in terms of both computation and memory. In this paper, we propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. Each sub-Transformer can be easily obtained by cropping the parameters of the largest Transformer. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers, which introduces additional supervisions from word-level and sequence-level self-distillation. Extensive experiments were conducted on WMT EN-De and En-Fr to validate our proposed scalable Transformers.
翻译:神经机器翻译(NMT)广泛采用变异器,因为其容量巨大,并同时对序列生成进行了培训;然而,由于不同的情景需要不同复杂程度和规模的模型,因此变异器的部署具有挑战性,因为不同的情景需要不同的模型。在计算和记忆方面,对多种变异器的培训是多余的。在本文中,我们提议了一个新的可缩放变异器,它自然包含不同规模的子转换器,并具有共同参数。每个子变异器都可以通过绘制最大变异器的参数很容易获得。提议了一个三阶段培训计划,以解决培训可缩放变异器的困难,该变异器从字级和序列级的自我蒸馏中引入额外的监督。在WMT ENDE和 En-Fr上进行了广泛的实验,以验证我们提议的变异器。