Semi-supervised (SS) inference has received much attention in recent years. Apart from a moderate-sized labeled data, L, the SS setting is characterized by an additional, much larger sized, unlabeled data, U. The setting of |U| >> |L|, makes SS inference unique and different from the standard missing data problems, owing to natural violation of the so-called 'positivity' or 'overlap' assumption. However, most of the SS literature implicitly assumes L and U to be equally distributed, i.e., no selection bias in the labeling. Inferential challenges in missing at random (MAR) type labeling allowing for selection bias, are inevitably exacerbated by the decaying nature of the propensity score (PS). We address this gap for a prototype problem, the estimation of the response's mean. We propose a double robust SS (DRSS) mean estimator and give a complete characterization of its asymptotic properties. The proposed estimator is consistent as long as either the outcome or the PS model is correctly specified. When both models are correctly specified, we provide inference results with a non-standard consistency rate that depends on the smaller size |L|. The results are also extended to causal inference with imbalanced treatment groups. Further, we provide several novel choices of models and estimators of the decaying PS, including a novel offset logistic model and a stratified labeling model. We present their properties under both high and low dimensional settings. These may be of independent interest. Lastly, we present extensive simulations and also a real data application.
翻译:近些年来,半监督(SS)的推论受到了很多关注。除了中度标签数据外,L, SS的设置的特征是额外、大得多、无标签的数据,U. ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇, 使SS的推论与标准缺失数据问题有独特和不同之处, 原因是所谓的“ 隐性” 或“ 重叠” 假设自然违反。 然而, SS的文献大多暗含L 和U 选择平等分布, 即标签中没有选择偏差。随机( MAR) 类型标签中缺少的随机( MAR) 类型标签中允许选择偏差的推断性挑战不可避免地会因偏差的增加而加剧。 我们针对一个原型问题解决了这一差距, 对响应意味着什么。 我们提出一个双强的 SS (DRS) 符号(DRSS) 意味着估计, 并且对它的新颖性特性给出一个完整的模型。 拟议的估测算器, 只要随机结果或PS 型号模型中的偏差值是准确的, 当两个模型中我们提供了一个不精确的排序的排序中, 。