Despite the ubiquitous use of stochastic optimization algorithms in machine learning, the precise impact of these algorithms and their dynamics on generalization performance in realistic non-convex settings is still poorly understood. While recent work has revealed connections between generalization and heavy-tailed behavior in stochastic optimization, this work mainly relied on continuous-time approximations; and a rigorous treatment for the original discrete-time iterations is yet to be performed. To bridge this gap, we present novel bounds linking generalization to the lower tail exponent of the transition kernel associated with the optimizer around a local minimum, in both discrete- and continuous-time settings. To achieve this, we first prove a data- and algorithm-dependent generalization bound in terms of the celebrated Fernique-Talagrand functional applied to the trajectory of the optimizer. Then, we specialize this result by exploiting the Markovian structure of stochastic optimizers, and derive bounds in terms of their (data-dependent) transition kernels. We support our theory with empirical results from a variety of neural networks, showing correlations between generalization error and lower tail exponents.
翻译:尽管在机器学习中普遍使用随机优化算法,但这些算法的精确影响及其动态对现实的非convex环境中一般化表现的动态仍然不甚了解。虽然最近的工作揭示了在随机优化中一般化和重尾化行为之间的联系,但这项工作主要依赖连续时间近似;对原始离散时间迭代的严格处理尚未进行。为了缩小这一差距,我们提出了新颖的界限,把与优化相关的与本地最低离散和连续时间设置中与优化相关的转型内核的低尾端前导线连接起来。为了实现这一点,我们首先证明了一个数据和依赖算法的通用化,它以庆祝的Fernique-Talagrand功能的轨迹为约束,适用于优化的轨迹。然后,我们通过利用Stochestic优化器的Markovian结构,从它们(数据依赖性)转型内核的下端内核内核部分的界限中得出了连接。我们支持我们从各种神经网络得出的实验性结果,显示一般误差和低尾端的反差关系。