This paper provides an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. This characterization is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. It can be applied to obtain distribution-free upper and lower bounds on the gen-error. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information {\em shared} between the {\em labeled} and {\em pseudo-labeled} data samples. To deepen our understanding, we further explore two examples -- mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data $\lambda$ affects the gen-error under both scenarios. As $\lambda$ increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified {\em exactly} with our analysis, and is dependent on the \emph{cross-covariance} between the labeled and pseudo-labeled data sample. In logistic regression, the gen-error and the variance component of the excess risk also decrease as $\lambda$ increases.
翻译:本文对半监督学习( SSL) 的预期一般化错误( gen- error) 进行了精确的定性, 以使用 Gibbs 算法进行伪标签 。 这种定性表现在输出假设、 伪标签数据集和标签数据集之间的对称 KL 信息中。 它可以用于在 gen- error 上方和下方获得无分发限制的 gen- orror 值 。 我们的发现提供了新的洞察力, 即 使用伪标签的 SSLL 常规化性能不仅受到输出假设和输入培训数据之间信息的影响, 而且还受到 {em 标签} 和 {em 伪标签} 数据样本之间共享的信息 。 为了加深我们的理解, 我们进一步探讨两个例子 -- 意味着估算和物流回归。 特别是, 我们分析未贴标签的数据的值 $\lambda$ 的比例如何影响两种假设下的基因- 。 由于 $\ lambda$ 增加, 用于平均估测算的精度和精确值分析时, 的精度差值会增加 。 当我们标签 和精确分析时, 和 标值之间, 的基值之间, 的基值之间的差差值会增加 。