Innovations in neural architectures have fostered significant breakthroughs in language modeling and computer vision. Unfortunately, novel architectures often result in challenging hyper-parameter choices and training instability if the network parameters are not properly initialized. A number of architecture-specific initialization schemes have been proposed, but these schemes are not always portable to new architectures. This paper presents GradInit, an automated and architecture agnostic method for initializing neural networks. GradInit is based on a simple heuristic; the norm of each network layer is adjusted so that a single step of SGD or Adam with prescribed hyperparameters results in the smallest possible loss value. This adjustment is done by introducing a scalar multiplier variable in front of each parameter block, and then optimizing these variables using a simple numerical scheme. GradInit accelerates the convergence and test performance of many convolutional architectures, both with or without skip connections, and even without normalization layers. It also improves the stability of the original Transformer architecture for machine translation, enabling training it without learning rate warmup using either Adam or SGD under a wide range of learning rates and momentum coefficients. Code is available at https://github.com/zhuchen03/gradinit.
翻译:神经结构的创新促进了语言建模和计算机愿景方面的重大突破。 不幸的是,新结构往往导致在网络参数未适当初始化的情况下挑战超参数选择和培训不稳定性,如果网络参数没有适当初始化,则新结构结构往往会导致挑战超参数选择和培训不稳定性。 已经提出了一些特定建筑的初始化计划, 但这些计划并非总可以移植到新结构中。 本文展示了GradInit, 这是一种启动神经网络的自动和建筑的不可知性方法。 GradInit 以简单的超常为基础; 每个网络层的规范都经过调整, 使得SGD或Adam的单步或带有指定超参数的单步导致最小的损失值。 这一调整是通过在每个参数块前面引入一个标度倍增变数变量来完成的, 然后利用一个简单的数字方案优化这些变量。 GradInit 加速了许多革命结构的趋同和测试性能, 不论是否跳过连接, 甚至没有正常化层。 它还改善了机器翻译原变换结构的稳定性, 使得在广泛的学习速度/ SGD 和动力系数下, http http://comgistrubs.