半超经验风险最小化:何时可以改进无标签数据预测? (Semi-Supervised Empirical Risk Minimization: When can unlabeled data improve prediction?)

We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we provide a careful treatment of the effectiveness of the SSL to improve prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where the SSL outperforms both the ERM learning and the null model. In the special case of linear regression with Gaussian covariates, we show that the previously suggested semi-supervised estimator is in fact not capable of improving on both the supervised estimator and the null model simultaneously. However, the new estimator presented in this work, can achieve an improvement of $O(1/n)$ term over both competitors simultaneously. On the other hand, we show that in other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions, having unlabeled data can derive substantial improvement in practice by applying our suggested SSL approach. Moreover, it is possible to identify the situations where SSL improves prediction, by using the results we establish throughout this work. This is shown empirically through extensive simulations.

翻译：我们提出了一个使用未贴标签数据的一般方法,用于设计模拟风险最小化(ERM)学习过程的半监督学习变体。我们注重一般线性回归,对SSL的有效性进行仔细处理,以改进预测性能。主要想法正在仔细考虑作为竞争对手的无效模型,并利用未贴标签数据确定信号-噪音组合,使SSL在机构风险管理学习和无效模型方面都优于机构风险管理学习和无效模型。在与Gaussian 共变体进行线性回归的特殊情况下,我们表明,先前建议的半监督性估算器实际上无法同时改进受监督的估算器和无效模型。然而,这项工作中的新估算器可以同时将美元(1/n)值用于改进两个竞争者之间的任期。另一方面,我们表明,在其他情景中,如非伽西文的共变法、错误描述线性回归或非线性回归法非线性连接功能的通用线性回归,没有标签的估算器,事实上无法同时改进受监督的估测的估测的估量器和无效模型。在这项工作中,通过我们建议的SLSL的演算法,通过可能的推算结果,可以形成广泛的推算,从而确定我们整个SLV的推算结果。