Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the sample-wise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization (NIO) algorithm. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost compared with the training time. With NIO, we improve the classification performance of a variety of neural architectures on CIFAR-10, CIFAR-100, and ImageNet. Moreover, we find that our method can even help to train large vision Transformer architecture without warmup.
翻译:为了减少人类在设计神经结构和寻找适当的超光度计方面的努力,广泛探索了自动化机器学习,以减少人类在设计神经结构和寻找适当的超光度计方面的努力。然而,在神经初始化领域,类似的自动化技术很少研究。现有的初始化方法大多是手工制作的,高度依赖特定结构。在本文中,我们提出了不同的数量,名为GradCosine, 其理论见解可用于评估神经网络的初始状态。具体地说,GradCosine是初步参数样本和样本梯度的相似性。通过分析样本和优化景观,我们表明,在梯度规范限制下最大限度地利用格拉德-科西纳,可以改进网络的培训和测试性能。根据这一观察,我们进一步提出了神经初始化优化的算法。从抽样分析到真实的批量设置,国家工业组织能够自动地寻找更好的初始化,其成本与培训时间相比微不足道。通过国家工业组织,我们改进了网络的各种神经结构的分类工作表现。我们找到了在CIFAR-10、CIFAR-100、CFAR-100和图像网络上没有大规模转换结构的帮助。此外,我们可以找到一种方法。