We consider the problem of training a multi-layer over-parametrized neural networks to minimize the empirical risk induced by a loss function. In the typical setting of over-parametrization, the network width $m$ is much larger than the data dimension $d$ and number of training samples $n$ ($m=\mathrm{poly}(n,d)$), which induces a prohibitive large weight matrix $W\in \mathbb{R}^{m\times m}$ per layer. Naively, one has to pay $O(m^2)$ time to read the weight matrix and evaluate the neural network function in both forward and backward computation. In this work, we show how to reduce the training cost per iteration, specifically, we propose a framework that uses $m^2$ cost only in the initialization phase and achieves a truly subquadratic cost per iteration in terms of $m$, i.e., $m^{2-\Omega(1)}$ per iteration. To obtain this result, we make use of various techniques, including a shifted ReLU-based sparsifier, a lazy low rank maintenance data structure, fast rectangular matrix multiplication, tensor-based sketching techniques and preconditioning.
翻译:我们考虑了培训多层超平衡神经网络以最大限度地减少损失功能引起的实验风险的问题。在典型的超平衡环境下,网络宽度百万美元大大大于数据维度(美元)和培训样本数量(美元)(n,d)美元),这导致每层超重超重的超重总基数(美元)高得令人望而却步。纳里,一个人必须花(m)2美元的时间来阅读重量矩阵并评估前向和后向计算中神经网络功能。在这项工作中,我们展示了如何降低每次循环的培训成本,具体地说,我们提出了一个框架,仅在初始化阶段使用2美元成本,并按每层每层每升取1美元(美元),即,即$2美元,俄加(1)美元。为了取得这一结果,我们使用了各种技术,包括改变的弹性弹性弹性、弹性的甚低级基质结构、弹性的甚高基质数据采集器。