Machine learning models that are developed with invariance to certain types of data transformations have demonstrated superior generalization performance in practice. However, the underlying mechanism that explains why invariance leads to better generalization is not well-understood, limiting our ability to select appropriate data transformations for a given dataset. This paper studies the generalization benefit of model invariance by introducing the sample cover induced by transformations, i.e., a representative subset of a dataset that can approximately recover the whole dataset using transformations. Based on this notion, we refine the generalization bound for invariant models and characterize the suitability of a set of data transformations by the sample covering number induced by transformations, i.e., the smallest size of its induced sample covers. We show that the generalization bound can be tightened for suitable transformations that have a small sample covering number. Moreover, our proposed sample covering number can be empirically evaluated, providing a practical guide for selecting transformations to develop model invariance for better generalization. We evaluate the sample covering numbers for commonly used transformations on multiple datasets and demonstrate that the smaller sample covering number for a set of transformations indicates a smaller gap between the test and training error for invariant models, thus validating our propositions.
翻译:与某些类型的数据转换不同而开发的机床学习模型在不考虑某些类型的数据转换的情况下得到了发展,这些模型在实践上表现出了超优的概括性表现。然而,解释为什么不考虑导致更概括化的基本机制没有很好地理解,限制了我们为某一数据集选择适当数据转换的能力。本文研究模型变化的概括性好处,引进了由变换引起的样本覆盖,即一个代表性数据集子集,能够利用变换来大致恢复整个数据集。根据这一概念,我们完善了变换模型的概括性约束,并用包括变换引起的数字的样本来描述一套数据变换的适宜性,即其引出样本的最小大小。我们表明,对于具有少量抽样覆盖数字的合适变换,一般化约束可以更加严格。此外,我们提议的包括数字的样本可以进行实验性评估,为选择变换模式提供实用指南,以发展变换模式,以更好地概括化。我们评估了多个数据集中常用的变换数字的样本,并展示了包括变变换模型中较小数目的样本,从而显示用于测试的变异性模型的较小。