Since the recognition in the early nineties of the vanishing/exploding (V/E) gradient issue plaguing the training of neural networks (NNs), significant efforts have been exerted to overcome this obstacle. However, a clear solution to the V/E issue remained elusive so far. In this manuscript a new architecture of NN is proposed, designed to mathematically prevent the V/E issue to occur. The pursuit of approximate dynamical isometry, i.e. parameter configurations where the singular values of the input-output Jacobian are tightly distributed around 1, leads to the derivation of a NN's architecture that shares common traits with the popular Residual Network model. Instead of skipping connections between layers, the idea is to filter the previous activations orthogonally and add them to the nonlinear activations of the next layer, realising a convex combination between them. Remarkably, the impossibility for the gradient updates to either vanish or explode is demonstrated with analytical bounds that hold even in the infinite depth case. The effectiveness of this method is empirically proved by means of training via backpropagation an extremely deep multilayer perceptron of 50k layers, and an Elman NN to learn long-term dependencies in the input of 10k time steps in the past. Compared with other architectures specifically devised to deal with the V/E problem, e.g. LSTMs for recurrent NNs, the proposed model is way simpler yet more effective. Surprisingly, a single layer vanilla RNN can be enhanced to reach state of the art performance, while converging super fast; for instance on the psMNIST task, it is possible to get test accuracy of over 94% in the first epoch, and over 98% after just 10 epochs.
翻译:自90年代初期人们认识到神经网络的消失/爆炸(V/E)梯度问题已经消失/爆炸(V/E)梯度问题困扰神经网络的培训以来,已经为克服这一障碍做出了重大努力。然而,到目前为止,V/E问题的清晰解决方案仍然遥不可及。在本手稿中,提出了一个新的NN的架构,目的是从数学角度防止V/E问题的发生。追求大约是动态的等量测量,即输入输出 Jacobian 的单值紧紧分布在1周左右的参数配置,导致NN的架构与流行的残余网络模式共享共同特性。但是,这个架构不是跳过层之间的连接,而是要过滤先前的激活或添加到下层的非线性启动中,同时实现同流的连接。 显而易见,对于模型的首次消失或爆炸来说不可能实现更新,即使在无限深度的情况下,也存在分析的界限。这个方法的实效是通过更清洁的常规结构,而具体地证明,在最精确的50级之间,在最深层进行更精确的反复的排序中,最后的顺序是学习10级结构。