In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility" framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Our new generalization bounds that we establish are tail bounds, tail bounds on the expectation, and in-expectations bounds. Moreover, it is shown that our framework also allows to derive general bounds on any function of the input data and output hypothesis random variables. In particular, these general bounds are shown to subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension based bounds is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with rate-distortion dimension of process, R\'enyi information dimension of process, and metric mean dimension.
翻译:在本文中, 我们通过新推出的“ 变式大小压缩” 框架的镜框, 建立基于数据的一般错误的新型上限。 在这个框架中, 算法的概括错误与输入数据的“ 压缩率” 变量大小的“ 压缩率” 相关联。 这显示产生取决于对手头输入数据的实验性测量的边框, 而不是其未知的分布。 我们所建立的新的概括性边框是尾圈、 期望的尾圈和预期的边框。 此外, 也显示我们的框架还允许在输入数据和输出假设随机变量的任何函数上得出一般界限。 特别是, 这些一般边框显示会吸收并有可能改进现有的几个PAC- 贝耶斯和数据依赖的内含维的内线, 这些边框作为特殊情况被恢复, 从而揭示了我们方法的统一性。 例如, 基于数据独立的内涵的新的边框已经建立起来, 将一般化错误与最优化的轨迹轨迹连接起来, 并揭示了各种有趣的连接, 率- 度- 度进程 度 和 度 度 的 度 度 度 的 度 度 度 度 进程 的 。</s>