Sparse training is a natural idea to accelerate the training speed of deep neural networks and save the memory usage, especially since large modern neural networks are significantly over-parameterized. However, most of the existing methods cannot achieve this goal in practice because the chain rule based gradient (w.r.t. structure parameters) estimators adopted by previous methods require dense computation at least in the backward propagation step. This paper solves this problem by proposing an efficient sparse training method with completely sparse forward and backward passes. We first formulate the training process as a continuous minimization problem under global sparsity constraint. We then separate the optimization process into two steps, corresponding to weight update and structure parameter update. For the former step, we use the conventional chain rule, which can be sparse via exploiting the sparse structure. For the latter step, instead of using the chain rule based gradient estimators as in existing methods, we propose a variance reduced policy gradient estimator, which only requires two forward passes without backward propagation, thus achieving completely sparse training. We prove that the variance of our gradient estimator is bounded. Extensive experimental results on real-world datasets demonstrate that compared to previous methods, our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.
翻译:简单培训是加快深神经网络培训速度并节省记忆使用的一个自然想法,特别是因为大型现代神经网络的超度明显过强。然而,大多数现有方法在实践中无法实现这一目标,因为以往方法采用的链规则梯度(w.r.t.结构参数)估计器要求至少在后向传播步骤中进行密集计算。本文件提出一个高效的稀疏培训方法,以完全分散的前向和后向传递的方式解决这个问题。我们首先将培训进程作为全球宽度制约下的一个持续最小化问题来设计。然后,我们将优化进程分为两个步骤,与重量更新和结构参数更新相对应。对于前一个步骤,我们使用常规链规则,而该规则可以通过利用稀疏结构稀疏。对于后一个步骤,我们建议采用一个差异性较低的政策梯度估计器,而不是完全分散前向传播,从而实现完全稀疏的培训。我们证明,我们的梯度测量器差异被捆绑了两个步骤,与重量更新和结构参数更新相适应。对于前一步骤,我们使用传统的链条规则,通过利用稀松散的结构可以分散。对于后一步使用。对于后一步骤,我们进行更迅速的实验性实验性实验,我们的数据测算过程将比前世界数据测过程的进度要快得多地显示,比前向前向前向前向较前向较前向前向较之快。