Over-parameterized models can perfectly learn various types of data distributions, however, generalization error is usually lower for real data in comparison to artificial data. This suggests that the properties of data distributions have an impact on generalization capability. This work focuses on the search space defined by the input data and assumes that the correlation between labels of neighboring input values influences generalization. If correlation is low, the randomness of the input data space is high leading to high generalization error. We suggest to measure the randomness of an input data space using Maurer's universal. Results for synthetic classification tasks and common image classification benchmarks (MNIST, CIFAR10, and Microsoft's cats vs. dogs data set) find a high correlation between the randomness of input data spaces and the generalization error of deep neural networks for binary classification problems.
翻译:超参数模型完全可以了解各种类型的数据分布,但是,对于真实数据而言,一般错误通常比人工数据低,这表明数据分布的特性对一般化能力有影响。这项工作侧重于输入数据定义的搜索空间,并假设相邻输入值标签的关联性会影响一般化。如果关联性低,输入数据空间的随机性会高导致高一般化错误。我们建议使用Maurer的通用数据测量输入数据空间的随机性。合成分类任务和通用图像分类基准(MNIST、CIFAR10和微软猫对狗数据集)的结果发现输入数据空间随机性和二元分类问题深度神经网络的一般错误之间存在高度的关联性。