Semi-supervised learning by self-training heavily relies on pseudo-label selection (PLS). The selection often depends on the initial model fit on labeled data. Early overfitting might thus be propagated to the final model by selecting instances with overconfident but erroneous predictions, often referred to as confirmation bias. This paper introduces BPLS, a Bayesian framework for PLS that aims to mitigate this issue. At its core lies a criterion for selecting instances to label: an analytical approximation of the posterior predictive of pseudo-samples. We derive this selection criterion by proving Bayes optimality of the posterior predictive of pseudo-samples. We further overcome computational hurdles by approximating the criterion analytically. Its relation to the marginal likelihood allows us to come up with an approximation based on Laplace's method and the Gaussian integral. We empirically assess BPLS for parametric generalized linear and non-parametric generalized additive models on simulated and real-world data. When faced with high-dimensional data prone to overfitting, BPLS outperforms traditional PLS methods.
翻译:自我培训的半监督学习在很大程度上依赖于假标签选择。 选择通常取决于适合标签数据的初始模型。 因此,早期过度改造可能会通过选择过于自信但错误的预测(通常称为确认偏差)来传播给最终模型。 本文介绍了BPLS, 这是巴伊西亚PLS框架, 目的是缓解这一问题。 其核心是选择标签实例的标准: 伪样本后端预测的分析近似值。 我们通过证明伪样本后端预测的贝斯最佳性来获取这一选择标准。 我们通过分析分析标准来进一步克服计算障碍。 它与边际可能性的关系使我们能够根据Laplace的方法和Gaussian 集成法得出一个近似值。 我们从经验上评估BPLS, 用于模拟和真实世界数据上对准通用线性和非参数通用通用添加模型的参数。 当面临高维数据时, BPLS 超越传统的 PLS 方法。