We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions.
翻译:我们提出了一个使用未贴标签数据的一般性方法,用以使用未贴标签的数据设计模拟监督风险最小化(ERM)学习过程的半监督学习变体(SSL)变体。我们以一般线性回归为重点,分析了我们的SSL方法在改善预测性能方面的有效性。关键想法正在仔细考虑将无效模型作为竞争者,并利用未贴标签数据确定信号-噪音组合,使SSL在监督性学习和无效模型两方面都优于监督性学习和无效模型。然后,我们根据对信号和噪音的估计,以适应的方式使用SSL。在与Gaussian 共变量的线性回归的特殊情况下,我们证明非调整性SSL版本实际上无法同时改进监督性天体和空体模型,超过可忽略的O(1/n)期。另一方面,这项工作中介绍的适应性模型可以在各种环境下同时对竞争者作出重大改进。这通过广泛的模拟,并扩大到其他情景,例如非Gasurian Covariates、错误的线性回归或直线性回归与非Gal性回归功能的普遍性连接。