在分层抽样下对半监督环境的预测规则进行有效的估算和评价 (Efficient Estimation and Evaluation of Prediction Rules in Semi-Supervised Settings under Stratified Sampling)

In many contemporary applications, large amounts of unlabeled data are readily available while labeled examples are limited. There has been substantial interest in semi-supervised learning (SSL) which aims to leverage unlabeled data to improve estimation or prediction. However, current SSL literature focuses primarily on settings where labeled data is selected randomly from the population of interest. Non-random sampling, while posing additional analytical challenges, is highly applicable to many real world problems. Moreover, no SSL methods currently exist for estimating the prediction performance of a fitted model under non-random sampling. In this paper, we propose a two-step SSL procedure for evaluating a prediction rule derived from a working binary regression model based on the Brier score and overall misclassification rate under stratified sampling. In step I, we impute the missing labels via weighted regression with nonlinear basis functions to account for nonrandom sampling and to improve efficiency. In step II, we augment the initial imputations to ensure the consistency of the resulting estimators regardless of the specification of the prediction model or the imputation model. The final estimator is then obtained with the augmented imputations. We provide asymptotic theory and numerical studies illustrating that our proposals outperform their supervised counterparts in terms of efficiency gain. Our methods are motivated by electronic health records (EHR) research and validated with a real data analysis of an EHR-based study of diabetic neuropathy.

翻译：在许多当代应用中,大量未贴标签的数据很容易获得,而贴标签的例子则有限。对半监督的神经学习(SSL)的兴趣很大,其目的是利用未贴标签的数据来改进估计或预测。然而,目前的SSL文献主要侧重于从感兴趣的人群中随机选取标签数据的环境。非随机抽样虽然带来额外的分析挑战,但高度适用于许多现实世界问题。此外,目前没有SSL方法来估计非随机抽样中适合的模型的预测性能。在本文件中,我们提出一个两步的SSL程序,用于评价根据基于 Brier评分的工作二进制回归模型得出的预测规则,以及根据分数抽样抽样中的总体分类率。在第一步,我们通过非线性基功能的加权回归来估计缺失的标签,以说明非随机抽样抽样的采样和提高效率。在第二阶段,我们扩大最初的估算方法,以确保由此产生的估算者的一致性,而不论预测模型或估算的估算模型的规格如何。我们提出了两步的SLSLSL程序,用以评价根据基于工作双进制回归模型得出的预测规则,在工作回归模型中得出的预测性总体分类率率率率率率率率。我们随后通过非线基化数据分析提供了一种估算性数据分析方法。我们以更新的图表分析方法,然后用推算出一种数字分析方法,然后用推算出我们的数据分析方法来进行。我们的数据分析。我们的数据分析。我们的数据分析方法,然后用推算取的模型的模型的精确性地分析。我们进行。我们的数据分析。