We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation, and (2) across blocks using block-wise scaling, which allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average. Our source code is available at: \url{https://github.com/sacmehta/delight}
翻译:我们引入了深度和轻量级变压器DeLighT, 其性能与标准变压器模型相似或更好,参数要少得多。 DeLighT更高效地分配参数:(1) 利用DeLighT变换,即深度和轻度变换,在每个变压器块内分配参数;(2) 使用块状缩放,跨块分配参数,允许在输入处附近使用浅度和窄度的 DeLighT 区块,在输出处附近使用宽度和深度的DeLighT 区块。总的来说, DeLighT 网络比标准变压器模型深2.5至4倍,但参数和操作却较少。关于基准机器翻译和语言建模任务的实验显示,DeLighT 与基准变压器的性能相匹配或改进,平均参数减少2至3倍。我们的源码见:https://github.com/sacmehta/delight}