The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor.
翻译:深层学习、变异推断和许多其他领域的成功都得到了专门实施反模式自动差异(AD)的专门应用的帮助,以计算巨型目标的梯度。这些工具背后的AD技术旨在将精确梯度计算成数字精确度,但现代机器学习模型几乎总是用随机梯度下降来训练。为什么计算和记忆只花在精确(微粒)梯度上,以便将其用于随机自动差异化(RAD)的总框架和办法?我们为随机化自动差异(RAD)制定了一个总框架和办法,允许在计算不偏差梯度估计数时减少记忆回扣。我们研究了一般方法的局限性,认为我们必须利用问题特定结构来实现效益。我们为各种简单的神经网络结构开发了RAD技术,并表明对于固定的记忆预算而言,RAD的密度比使用小批量供饲料的网络和对经常网络的类似数字要小。我们还表明,RAD可以应用于科学计算,并利用它来开发一种低度的粒度的粒度的粒度反应反应堆,以优化控制参数。