Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures in deep learning. While previous papers believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.
翻译:石化梯度与深神经网络(DNNs)的优化和全面化密切相关。有些作品试图通过梯度噪音的可论证重尾的特性来解释用于深层学习的石化优化的成功,而另一些作品则针对梯度噪音的重尾假设提供了理论和经验证据。不幸的是,用于分析深层学习中石化梯度结构和重尾的正式统计测试仍然不足。在本文中,我们主要做出两项贡献。首先,我们进行关于在参数和迭代之间分配随机梯度梯度和梯度噪音的正式统计测试。我们的统计测试显示,从维度梯度通常表现出电法重尾巴,而由微梯级训练引起的迭代梯度和随机梯度噪音通常没有表现出电法重尾巴。第二,我们进一步发现,在深层学习中,微梯度梯度梯度的共变异谱结构具有权力法结构。尽管以前的论文认为,在深层学习过程中,对梯度梯度梯度梯度梯度梯度梯度梯度结构的反结构具有深刻的难度,但是它们并不期望我们在深层学习时的数学结构。