Self-training is a simple yet effective method within semi-supervised learning. The idea is to iteratively enhance training data by adding pseudo-labeled data. Its generalization performance heavily depends on the selection of these pseudo-labeled data (PLS). In this paper, we aim at rendering PLS more robust towards the involved modeling assumptions. To this end, we propose to select pseudo-labeled data that maximize a multi-objective utility function. The latter is constructed to account for different sources of uncertainty, three of which we discuss in more detail: model selection, accumulation of errors and covariate shift. In the absence of second-order information on such uncertainties, we furthermore consider the generic approach of the generalized Bayesian alpha-cut updating rule for credal sets. As a practical proof of concept, we spotlight the application of three of our robust extensions on simulated and real-world data. Results suggest that in particular robustness w.r.t. model choice can lead to substantial accuracy gains.
翻译:自我培训是半监督学习中简单而有效的方法。 想法是通过添加伪标签数据来迭代强化培训数据。 其一般性能在很大程度上取决于这些伪标签数据的选择。 在本文中, 我们的目标是使PLS对相关的模型假设更加有力。 为此, 我们提议选择假标签数据, 以最大限度地实现多目标效用功能。 后者是用来计算不同不确定性来源, 其中三个来源是我们更详细讨论的: 模型选择、 错误累积和共变换。 在缺乏关于这些不确定性的第二阶级信息的情况下, 我们还考虑通用的巴伊西亚- 阿尔巴切 更新规则的通用方法。 作为概念的实际证明, 我们把三个强效扩展的应用聚焦在模拟和真实世界数据上。 结果显示, 特别是强性 W.r.t. 模式选择可以带来实质性的准确性收益 。</s>