Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.
翻译:长期以来,人们一直在研究反向调整的替代办法,以更好地了解生物大脑如何学习。最近,这些办法也吸引人们的兴趣,作为更高效地培训神经网络的一种方法。通过放松反向调整所固有的限制(例如对称进料和反馈权重、顺序更新),这些方法能够带来前景,例如当地学习。然而,在最后任务性能、趋同速度、最终计算和数据要求方面,不同方法之间的权衡很少得到概述。在这项工作中,我们利用法律来研究直接反馈对齐~(DFA)的能力,以便有效地培训因果关系偏向变异器。调整法律提供了对模型决定所隐含的权衡的概述,以推断它如何转移到越来越大的模型。我们发现,DFA未能提供比反向调整更有效率的缩放:对于使用DFA造成的损失的退化从来没有一种制度值得在计算预算方面进行可能的削减。我们的发现与替代培训方法的以往信念有差异,并强调需要全面的经验方法来更好地理解模型决定。