Recently, information theoretic analysis has become a popular framework for understanding the generalization behavior of deep neural networks. It allows a direct analysis for stochastic gradient/Langevin descent (SGD/SGLD) learning algorithms without strong assumptions such as Lipschitz or convexity conditions. However, the current generalization error bounds within this framework are still far from optimal, while substantial improvements on these bounds are quite challenging due to the intractability of high-dimensional information quantities. To address this issue, we first propose a novel information theoretical measure: kernelized Renyi's entropy, by utilizing operator representation in Hilbert space. It inherits the properties of Shannon's entropy and can be effectively calculated via simple random sampling, while remaining independent of the input dimension. We then establish the generalization error bounds for SGD/SGLD under kernelized Renyi's entropy, where the mutual information quantities can be directly calculated, enabling evaluation of the tightness of each intermediate step. We show that our information-theoretical bounds depend on the statistics of the stochastic gradients evaluated along with the iterates, and are rigorously tighter than the current state-of-the-art (SOTA) results. The theoretical findings are also supported by large-scale empirical studies1.
翻译:近期,信息论分析已成为理解深度神经网络的泛化行为的流行框架。它允许在没有诸如Lipschitz或凸性条件等强假设的情况下直接分析随机梯度/Langevin下降(SGD/SGLD)学习算法。然而,当前在该框架内的泛化误差界仍然远非最优,而这些界的实质性改进由于高维信息量的不可计算性而相当具有挑战性。为解决这个问题,我们首先提出了一种新的信息理论度量方法:Kernelized Renyi熵,通过借助希尔伯特空间中的算子表示进行计算。它继承了Shannon熵的一些特性,并且可以通过简单的随机采样有效地计算,同时保持与输入维度无关的特点。然后,我们在Kernelized Renyi熵的基础上建立SGD/SGLD的泛化误差界,其中可以直接计算互信息量,从而实现了对每个中间步骤紧密度的评估。我们表明,我们的信息理论界取决于沿着迭代点计算的随机梯度的统计信息,且比当前的最先进结果严格更紧一些。理论发现也由规模庞大的实证研究所证实。