The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of AI. Despite the popularity and low cost-per-iteration of traditional Backpropagation via gradient decent, SGD has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with $m=\mathrm{poly}(n)$ parameters and input batch of $n$ datapoints in $\mathbb{R}^d$, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires $\sim mnd + n^3$ time per iteration. In this paper, we present a novel training method that requires only $m^{1-\alpha} n d + n^3$ amortized time in the same overparametrized regime, where $\alpha \in (0.01,1)$ is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of DNNs.
翻译:深层次学习的成功以巨大的计算成本和能源成本为代价,而培训过度平衡的神经网络的可扩展性正在成为AI进步的真正障碍。尽管通过梯度体面的传统的回溯性方案受到欢迎且成本低,但SGD在理论和实践上在非convex环境下,在理论和实践上都具有令人望而却步的趋同率。为了降低这一成本,最近的工程建议采用替代(Newton型)培训方法,其趋同率要快得多得多,尽管成本每增加更高。对于一个具有$mämäm{m{poly}(n)的典型神经网络来说,参数和输入量为$n$m>的数据点($mämathrm{poly)的组合和输入量($n)。尽管传统的SGD&Winstein(Brand,ITS'2021)以往的工作要求使用$mmmnd + n%3$(n3) 时间。在本文中,我们提出的新的培训方法只需要相信$N3$(n) am3$(polyal) ad) adal acretial deal deal deal deview view view views roview roview roview.