Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.
翻译:渗透深层(神经)网络的最成功方法就是那些适应性地掩盖整个培训过程中网络重量的方法。 通过检查这种蒙面或辍学,在线性案例中,我们发现了这种适应性方法与通过所谓的“$\eta$-trick”实现正规化的双重性,两者的双重性既表现为迭代的再加权优化。我们表明,任何适应单调方式的重量的辍学策略都相当于有效的次水下规范化处罚,因此导致零散的解决方案。我们为一些流行的垃圾化策略获得有效的处罚,这些策略与稀疏优化通常使用的典型处罚非常相似。考虑到作为一种案例研究的变异性辍学,我们展示了适应性退出方法与深网络聚变的经典方法之间的类似经验行为,从而验证了我们的理论。