Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. However, the standard forward gradient algorithm, when applied naively, suffers from high variance when the number of parameters to be learned is large. In this paper, we propose a series of architectural and algorithmic modifications that together make forward gradient learning practical for standard deep learning benchmark tasks. We show that it is possible to substantially reduce the variance of the forward gradient estimator by applying perturbations to activations rather than weights. We further improve the scalability of forward gradient by introducing a large number of local greedy loss functions, each of which involves only a small number of learnable parameters, and a new MLPMixer-inspired architecture, LocalMixer, that is more suitable for local learning. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
翻译:前方梯度学习计算出一个吵闹的方向梯度,是用于学习深神经网络的反向偏差的一种生物上可行的替代方法。 但是,标准的前方梯度算法,如果天真地应用,当需要学习的参数数量巨大时,就会有很大差异。 在本文中,我们提出了一系列的建筑和算法修改,使前方梯度学习对标准的深学习基准任务具有实用性。我们表明,通过对激活应用扰动而非权重来大幅降低远梯度测量器的差异是可能的。我们通过引入大量本地贪婪损失功能,进一步提高前方梯度的可缩放性,其中每种功能都只涉及少量的可学习参数,以及一个新的MLPMixer启发型架构,即本地混合器,更适合本地学习。我们的方法与MNIST和CIFAR-10的背法相匹配,并大大超出先前在图像网上提议的反向偏向式算法。