Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.
翻译:在现代机器学习环境中,普遍理解现代机器学习环境是统计学习理论的主要挑战之一。在这方面,近些年来,我们目睹了各种一般化界限的发展,表明不同的复杂概念,例如数据抽样和算法输出之间的相互信息、假设空间的压缩和假设空间的分形维度。虽然这些界限从不同的角度揭示了手头的问题,但所提出的复杂概念似乎似乎无关紧要,从而限制了其高层次的影响。在本研究中,我们证明,通过比率扭曲理论的镜像,有新的一般化界限,明确将相互信息、压缩和分解维度的概念纳入一个单一数学框架。我们的方法包括:(一) 使用源编码概念界定一个普遍化的精确性概念,以及(二) 表明“压力错误率”既可以与预期的普遍化错误相联系,也可以与高概率挂钩。我们在“不损失压缩”的设置中,我们恢复并改进了现有的相互基于信息的界限,而“损失压缩”和“分解度”的维度在单一数学框架中则明确地将相互信息概念明确联系起来。我们的方法包括:(一)通过利用源码编码来界定一个更统一性的一般方向,使我们将某些的分位化的分化。