Network sparsity receives popularity mostly due to its capability to reduce the network complexity. Extensive studies excavate gradient-driven sparsity. Typically, these methods are constructed upon premise of weight independence, which however, is contrary to the fact that weights are mutually influenced. Thus, their performance remains to be improved. In this paper, we propose to further optimize gradient-driven sparsity (OptG) by solving this independence paradox. Our motive comes from the recent advances on supermask training which shows that sparse subnetworks can be located in a randomly initialized network by simply updating mask values without modifying any weight. We prove that supermask training is to accumulate the weight gradients and can partly solve the independence paradox. Consequently, OptG integrates supermask training into gradient-driven sparsity, and a specialized mask optimizer is designed to solve the independence paradox. Experiments show that OptG can well surpass many existing state-of-the-art competitors. Our code is available at \url{https://github.com/zyxxmu/OptG}.
翻译:网络宽度受到欢迎,主要是因为它有能力降低网络复杂性。广泛的研究挖掘了梯度驱动的宽度。通常,这些方法是在重量独立前提下设计的,但与重量相互影响的事实相反。因此,其性能仍有待改进。在本文中,我们提议通过解决这一独立悖论,进一步优化梯度驱动宽度(OptG),我们的动机来自最近对超级卫星培训的进步,该进步显示,稀疏的子网络可以通过仅仅更新面罩值而不改变重量来随机初始化的网络。我们证明,超大型网络培训是为了积累重量梯度,可以部分解决独立悖论。因此,OptG将超大型图像培训纳入梯度驱动的宽度,并设计一个专门化的面罩优化器来解决独立性悖论。实验显示,OptG可以远远超过许多现有的州-艺术竞争者。我们的代码可以在\url/https://github.com/zyxmu/OptG}。