We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.
翻译:我们根据输入和参数分布研究最大值网络的梯度,并获得根据结构及参数分布所决定的时刻的界限。我们观察到,投入-产出Jacobian的分布取决于输入量,这使得稳定的参数初始化复杂化。根据梯度的瞬间,我们制定参数初始化战略,避免在宽广的网络中消失和爆炸梯度。与完全连接和连动的深网络进行的实验表明,这一战略改善了SGD和Adam对深度最大值网络的培训。此外,我们还获得了线性区域预期数量、预期曲线长度扭曲结果和NTK结果的精细界限。