Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solutions that overlook the role of "class" in causing the non-randomness, e.g., users are more likely to label popular classes, we explicitly incorporate "class" into SSL. Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data. 2) To encourage rare class training, whose model is low-recall but high-precision that discards too many pseudo-labeled data, we propose Class-Aware Imputation (CAI) that dynamically decreases (or increases) the pseudo-label assignment threshold for rare (or frequent) classes. 3) Overall, we integrate CAP and CAI into a Class-Aware Doubly Robust (CADR) estimator for training an unbiased SSL model. Under various MNAR settings and ablations, our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods. Please check our code at: https://github.com/JoyHuYY1412/CADR-FixMatch.
翻译:半保密学习(SSL)从根本上说是一个缺失的标签问题, 标签标签“ 不在随机( MNAR) 不在随机( MNAR) ” 问题与广泛采用但天真的“完全失踪 ” At Rant 假设相比, 标签标签“ 不在随机” 问题更现实、更具挑战性。 标签和未贴标签的数据共享同一类分布。 现有的 SLSL 解决方案忽略了“ 类” 在造成非随机性( 例如,用户更有可能给流行类贴标签), 我们明确地将“ 类” 纳入 SL。 我们的方法有三重:(1) 我们提议, 利用未贴标签的数据来用带有偏见的标签数据来训练经过改进的分类器。 2 鼓励稀有的班培训, 其模式是低调但高精度, 抛弃了太多的假标签数据, 例如, 我们建议类“ 软件识别( CAI) 动态地减少( 或增加) 稀有( 或经常) 类的伪标签分配标准 。 3 总体说, 我们将CAP和 CAI (C) 将CAP) 纳入一个分类- Anual- Awarde-Avally Robly Robly Robly milly mille- redustration (Sl) 。