斯托卡优化的分形结构和一般属性 (Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms)

Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of the data and the algorithm determine the generalization performance. In this study, we approach this problem from a dynamical systems theory perspective and represent stochastic optimization algorithms as random iterated function systems (IFS). Well studied in the dynamical systems literature, under mild assumptions, such IFSs can be shown to be ergodic with an invariant measure that is often supported on sets with a fractal structure. As our main contribution, we prove that the generalization error of a stochastic optimization algorithm can be bounded based on the `complexity' of the fractal structure that underlies its invariant measure. Leveraging results from dynamical systems theory, we show that the generalization error can be explicitly linked to the choice of the algorithm (e.g., stochastic gradient descent -- SGD), algorithm hyperparameters (e.g., step-size, batch-size), and the geometry of the problem (e.g., Hessian of the loss). We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden-layered neural networks) and algorithms (e.g., SGD and preconditioned variants), and obtain analytical estimates for our bound.For modern neural networks, we develop an efficient algorithm to compute the developed bound and support our theory with various experiments on neural networks.

翻译：深层学习的总体理解是过去十年来统计学习理论中的主要挑战之一。虽然最近的工作表明,为了获得有意义的一般化界限,必须考虑到数据集和培训算法,在理论上仍然不清楚数据和算法的哪些属性决定了总体化的性能。在本研究中,我们从动态系统理论角度来处理这一问题,并以随机迭代功能系统(IFS)来代表随机迭代功能算法。在动态系统文献中研究的很好,在温和假设下,这类IFS可以显示为一种不可变的测量值,而这种测量值往往是在带有折变结构的组合上支持的。我们的主要贡献是,我们证明一个随机优化算法的笼统错误可以基于其内向性测量的“相容性”来进行。我们动态系统理论的萎缩结果,我们一般化错误可以明确地与算法的选择(例如,变相偏直的内向性内向性内向性内向性内向网络 -- SGDG. 和超变数矩阵(SGD) 的内向性梯度, 和内向性变数变数(S&H. squlational-deal-deal) roislational-hisl) mabal-slational-slate-slate-slate-slate-slational-slational-slationalislock-slock-slock-slational-s-s-sl-slock-slock-s-slock-slock-slate-slututus-sld-sld-slock-sal-s-s-sl-s-s-s-s-s-s-s-s-s-s-s-sl-sl-sl-sl-sl-s-s-sl-sl-sl-sl-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s