We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $\alpha$, controlling the weight given to the unlabeled data. Focusing on Generalized-Linear-Models (GLM), we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is inevitably beneficial to integrate the unlabeled data with some non-zero mixing ratio $\alpha>0$, in terms of predictive performance. Moreover, we provide a rigorous framework for estimating the best mixing ratio $\alpha^*$ where mixed-SSL delivers the best predictive performance, while using the labeled and the unlabeled data on hand. The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, under a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) in improving more complex models such as deep neural networks, in a real-world regression tasks.
翻译:我们提出了一个方法,用于使用未贴标签的数据来设计半监督学习方法,改善受监督回归任务学习的预测性能。主要想法是设计不同机制,以整合未贴标签数据,并在每种机制中包括混合参数$\alpha$,控制未贴标签数据的权重。我们侧重于通用-Linaear-Models(GLM),分析不同混合机制的特点,并证明在所有情况下,将未贴标签数据与某种非零混合比率$\alpha>0(美元)相结合,在预测性能方面必然是有益的。此外,我们提供了一个严格的框架,用于估算最佳混合-SSL提供最佳预测性能的混合比率$\alpha ⁇ $,同时使用标签和手头未贴标签数据。我们的方法与标准监督模式相比,在各种环境下,与标准监测模式相比,在广泛模拟中实现重大改进的有效性,通过支持理论分析的方式,从经验上证明我们的方法(在更复杂的模型中,在更复杂的模型中显示我们的方法(有一些直观的回归)是否适用。