Knowledge distillation has been proven to be effective in model acceleration and compression. It allows a small network to learn to generalize in the same way as a large network. Recent successes in pre-training suggest the effectiveness of transferring model parameters. Inspired by this, we investigate methods of model acceleration and compression in another line of research. We propose Weight Distillation to transfer the knowledge in the large network parameters through a parameter generator. Our experiments on WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks show that weight distillation can train a small network that is 1.88~2.94x faster than the large network but with competitive performance. With the same sized small network, weight distillation can outperform knowledge distillation by 0.51~1.82 BLEU points.
翻译:事实证明, 知识蒸馏在模型加速和压缩方面是有效的。 它使小型网络能够学习与大型网络一样的概括化。 培训前最近的成功表明, 传输模型参数的有效性。 受此启发, 我们研究另一个研究线的模型加速和压缩方法。 我们建议通过参数生成器在大型网络参数中传输知识。 我们在 WMT16 En- Ro、 NIST12 Zh- En 和 WMT14 En- De 机器翻译任务方面的实验显示, 重量蒸馏可以培训比大型网络快1.88~ 2.94x 的小型网络, 并且具有竞争性性能。 在同一个规模的小网络中, 重量蒸馏可以超过0. 51~ 1.82 BLEU 点的知识蒸馏。