Learnability, Sample Complexity, and Hypothesis Class Complexity for Regression Models (Learnability, Sample Complexity, and Hypothesis Class Complexity for Regression Models)

The goal of a learning algorithm is to receive a training data set as input and provide a hypothesis that can generalize to all possible data points from a domain set. The hypothesis is chosen from hypothesis classes with potentially different complexities. Linear regression modeling is an important category of learning algorithms. The practical uncertainty of the target samples affects the generalization performance of the learned model. Failing to choose a proper model or hypothesis class can lead to serious issues such as underfitting or overfitting. These issues have been addressed by alternating cost functions or by utilizing cross-validation methods. These approaches can introduce new hyperparameters with their own new challenges and uncertainties or increase the computational complexity of the learning algorithm. On the other hand, the theory of probably approximately correct (PAC) aims at defining learnability based on probabilistic settings. Despite its theoretical value, PAC does not address practical learning issues on many occasions. This work is inspired by the foundation of PAC and is motivated by the existing regression learning issues. The proposed approach, denoted by epsilon-Confidence Approximately Correct (epsilon CoAC), utilizes Kullback Leibler divergence (relative entropy) and proposes a new related typical set in the set of hyperparameters to tackle the learnability issue. Moreover, it enables the learner to compare hypothesis classes of different complexity orders and choose among them the optimum with the minimum epsilon in the epsilon CoAC framework. Not only the epsilon CoAC learnability overcomes the issues of overfitting and underfitting, but it also shows advantages and superiority over the well known cross-validation method in the sense of time consumption as well as in the sense of accuracy.

翻译：回归模型的可学习性、样本复杂度和假设类复杂度一个学习算法的目标是接收一个训练数据集并提供一个可以泛化到来自一个域集中所有可能数据点的假设。假设是从可能具有不同复杂度的假设类中选择的。线性回归建模是学习算法的一个重要类别。实际目标样本的不确定性会影响所学习模型的泛化性能。选择错误的模型或假设类可能会导致严重的问题，例如欠拟合或过拟合。这些问题可以通过交替成本函数或使用交叉验证方法来解决。这些方法可能会引入新的超参数，带来它们自己的新挑战和不确定性，或增加学习算法的计算复杂度。另一方面，基于概率近似正确理论（PAC）旨在基于概率设置来定义可学习性。尽管它的理论价值，但在许多场合下PAC不解决实际学习问题。这项工作受到PAC基础的启示，并受到现有回归学习问题的推动。所提出的方法，称为ε-置信大致正确（ε CoAC），利用Kullback-Leibler散度（相对熵）提出了一个新的相关典型集，以解决可学习性问题。此外，它使学习者可以比较不同复杂度阶级的假设类，选择其中具有最小ε的最优类，在ε CoAC框架中。ε CoAC学习可不仅克服过拟合和欠拟合问题，而且在时间消耗和准确性方面也展现出优越性，比知名的交叉验证方法更好。