We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.
翻译:我们为变换器提出了一种参数共享方法(Vaswani等人,2017年)。拟议方法放宽了一种广泛使用的技术,即与通用变换器等所有层共享一个层的参数(Dehghani等人,2019年),以提高计算时间的效率。我们提出了三种战略:序列、循环和周期(rev),为每个层指定参数。实验结果显示,拟议战略在参数大小和计算时间方面是有效的。此外,我们指出,拟议战略在配置中也是有效的,我们使用了许多培训数据,例如最近的WMT竞赛。