We present an information-theoretic framework for understanding overfitting and underfitting in machine learning and prove the formal undecidability of determining whether an arbitrary classification algorithm will overfit a dataset. Measuring algorithm capacity via the information transferred from datasets to models, we consider mismatches between algorithm capacities and datasets to provide a signature for when a model can overfit or underfit a dataset. We present results upper-bounding algorithm capacity, establish its relationship to quantities in the algorithmic search framework for machine learning, and relate our work to recent information-theoretic approaches to generalization.
翻译:我们提出了一个信息理论框架,用于理解机器学习中的超称和不足,并证明在确定任意分类算法是否超配数据集方面,正式的不可减损性。 通过从数据集传输到模型的信息测量算法能力,我们考虑算法能力与数据集之间的不匹配性,以便为模型能够超配或低于数据集提供签名。我们提出结果上限算法能力,在机器学习的算法搜索框架中建立它与数量的关系,并将我们的工作与最新信息理论方法的概括化联系起来。