The problem of vanishing and exploding gradients has been a long-standing obstacle that hinders the effective training of neural networks. Despite various tricks and techniques that have been employed to alleviate the problem in practice, there still lacks satisfactory theories or provable solutions. In this paper, we address the problem from the perspective of high-dimensional probability theory. We provide a rigorous result that shows, under mild conditions, how the vanishing/exploding gradients problem disappears with high probability if the neural networks have sufficient width. Our main idea is to constrain both forward and backward signal propagation in a nonlinear neural network through a new class of activation functions, namely Gaussian-Poincar\'e normalized functions, and orthogonal weight matrices. Experiments on both synthetic and real-world data validate our theory and confirm its effectiveness on very deep neural networks when applied in practice.
翻译:消失和爆炸的梯度问题一直是阻碍神经网络有效培训的长期障碍。尽管在实践中运用了各种技巧和技术来缓解这一问题,但仍然缺乏令人满意的理论或可验证的解决办法。在本文件中,我们从高维概率理论的角度来解决这个问题。我们提供了严格的结果,表明在温和的条件下,如果神经网络有足够的宽度,消失/爆炸的梯度问题会如何以很高的概率消失。我们的主要想法是通过新型的激活功能,即高斯-波因卡尔的正常功能和圆形重量矩阵,限制在非线性神经网络中前向和后向信号传播。合成数据和现实世界数据的实验证实了我们的理论,并在实际应用时证实了它在非常深的神经网络上的有效性。