The ultimate goal of a supervised learning algorithm is to produce models constructed on the training data that can generalize well to new examples. In classification, functional margin maximization -- correctly classifying as many training examples as possible with maximal confidence --has been known to construct models with good generalization guarantees. This work gives an information-theoretic interpretation of a margin maximizing model on a noiseless training dataset as one that achieves lossless maximal compression of said dataset -- i.e. extracts from the features all the useful information for predicting the label and no more. The connection offers new insights on generalization in supervised machine learning, showing margin maximization as a special case (that of classification) of a more general principle and explains the success and potential limitations of popular learning algorithms like gradient boosting. We support our observations with theoretical arguments and empirical evidence and identify interesting directions for future work.
翻译:受监督的学习算法的最终目标是制作以培训数据为基础的模型,这些模型可以很好地概括为新的例子。在分类方面,已知功能差最大化 -- -- 尽可能以最大信任准确地分类尽可能多的培训实例 -- -- 能够构建具有良好概括性保证的模型。这项工作对无噪音培训数据集的差值最大化模型提供了信息理论解释,该模型能够实现无噪音培训数据集无损最大压缩,即从功能上提取所有有用的信息,用于预测标签,而不是更多。这种联系为监督下的机器学习的概括化提供了新的洞察力,显示差值最大化是比较一般性原则的一个特殊案例(分类),并解释了诸如梯度增强等流行学习算法的成功和潜在局限性。我们支持我们的观点,提出理论论点和经验证据,并确定未来工作的有趣方向。