It is well established that increasing scale in deep transformer networks leads to improved quality and performance. This increase in scale often comes with an increase in compute cost and inference latency. Consequently, research into methods which help realize the benefits of increased scale without leading to an increase in the compute cost becomes important. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation without increasing the computation time by working on a subblock of the representation at each layer. Our experiments on various transformer models and language tasks demonstrate the consistent effectiveness of alternating updates on a diverse set of benchmarks. Finally, we present extensions of AltUp to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity.
翻译:众所周知,深层变压器网络规模的扩大导致质量和性能的提高。这种规模的扩大往往伴随着计算成本和推导延迟度的增加。 因此,研究有助于在不导致计算成本增加的情况下实现扩大规模的好处的方法变得非常重要。 我们引入了互换更新(AltUp),这是一种简单到实施的方法,可以提高模型的能力,而不必承担计算负担。“AltUp”通过在每一层代表的分块上工作,可以扩大所学的代表性,而不会增加计算时间。我们在各种变压器模型和语言任务上进行的实验显示了不同基准的交替更新的一贯有效性。最后,我们将“AltUp”扩展至序列层面,并演示“AltUp”如何与现有方法(如Sparse Mixture-Exportes模型)协同配合,以获得效率更高的模型。