A particular direction of recent advance about stochastic deep-learning algorithms has been about uncovering a rather mysterious heavy-tailed nature of the stationary distribution of these algorithms, even when the data distribution is not so. Moreover, the heavy-tail index is known to show interesting dependence on the input dimension of the net, the mini-batch size and the step size of the algorithm. In this short note, we undertake an experimental study of this index for S.G.D. while training a $\relu$ gate (in the realizable and in the binary classification setup) and for a variant of S.G.D. that was proven in Karmakar and Mukherjee (2022) for ReLU realizable data. From our experiments we conjecture that these two algorithms have similar heavy-tail behaviour on any data where the latter can be proven to converge. Secondly, we demonstrate that the heavy-tail index of the late time iterates in this model scenario has strikingly different properties than either what has been proven for linear hypothesis classes or what has been previously demonstrated for large nets.
翻译:最近关于深层学习算法的一个特别进展方向是发现这些算法的固定分布具有相当神秘的重尾性质,即使数据分布并非如此。此外,据了解,重尾指数显示了对网输入层面、微型批量尺寸和算法的步数大小的令人感兴趣的依赖。在这个简短的注释中,我们对S.G.D.的这个指数进行了一项实验性研究,同时训练了一个$\reluu$的门(可实现和二元分类设置),并训练了一个在Karmakar和Mukherjee(2022年)所证明的可实现数据的S.G.D.变式。我们从我们的实验中推测,这两种算法在任何数据上都具有类似的重尾行行为,而后一种数据可以证明是汇合的。第二,我们证明,这一模型情景中它晚期的重尾数指数的特性与线性假设类所证明的或以前为大网所证明的特性截然不同。