Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider the role of regularization, optimization and architecture choices on sparse models. We propose a simple experimental framework, Same Capacity Sparse vs Dense Comparison (SC-SDC), that allows for fair comparison of sparse and dense networks. Furthermore, we propose a new measure of gradient flow, Effective Gradient Flow (EGF), that better correlates to performance in sparse networks. Using top-line metrics, SC-SDC and EGF, we show that default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and taking a wider view of tailoring optimization to sparse networks yields promising results.
翻译:实践证明,与密集的神经结构形成相同性能的培训网络稀少是难以实现的。最近的工作表明,初始化是关键所在。然而,虽然这一研究方向取得了一些成功,但仅侧重于初始化似乎还不够。在本文件中,我们从更广的角度看待培训稀少的网络,并考虑在稀薄模式中规范化、优化和建筑选择的作用。我们提出了一个简单的实验框架,即Same Capacrese vs Dense Conference(S-SDC),允许对稀薄和密集的网络进行公平的比较。此外,我们提出了一个新的梯度流计量,即有效渐进流(EGFF),它与稀薄网络的绩效更相关。我们使用上线指标、SC-SDC和EGF, 表明用于密集的网络的默认优化、激活功能和正规化者的选择会不利于稀薄网络。基于这些发现,我们表明稀薄网络的梯度流动可以通过重新考虑结构设计和培训制度的各个方面来改进。我们的工作表明,初始化只是拼图的一部分,从更广的角度来调整稀薄网络的优化效果。