Modern neural network architectures typically have many millions of parameters and can be pruned significantly without substantial loss in effectiveness which demonstrates they are over-parameterized. The contribution of this work is two-fold. The first is a method for approximating a multivariate Bernoulli random variable by means of a deterministic and differentiable transformation of any real-valued multivariate random variable. The second is a method for model selection by element-wise multiplication of parameters with approximate binary gates that may be computed deterministically or stochastically and take on exact zero values. Sparsity is encouraged by the inclusion of a surrogate regularization to the $L_0$ loss. Since the method is differentiable it enables straightforward and efficient learning of model architectures by an empirical risk minimization procedure with stochastic gradient descent and theoretically enables conditional computation during training. The method also supports any arbitrary group sparsity over parameters or activations and therefore offers a framework for unstructured or flexible structured model pruning. To conclude experiments are performed to demonstrate the effectiveness of the proposed approach.
翻译:现代神经网络结构通常具有数以百万计的参数,并且可以大量修剪,而不会产生显著的效益损失,从而证明它们具有超度参数。这项工作的贡献是双重的。第一个方法是通过确定和可区别地转换任何实际价值的多变随机变量,来接近多变贝努利随机变数。第二个是模型选择方法,通过从元素角度将参数与大约二进制门的参数进行倍增来进行计算,这些参数可以以确定或随机方式计算,并以精确的零值来计算。由于将代用正规化方法纳入$L_0损失,因此鼓励了公平性。由于该方法有差异,它能够通过实验风险最小化程序直接和高效地学习模型结构,在培训期间可以使用随机梯度梯度梯度梯度梯度下降,理论上也能够进行有条件的计算。该方法还支持在参数或激活方面任意的群集聚度,从而提供一个不结构或灵活结构化模型运行的框架。完成实验是为了证明拟议方法的有效性。