There are many scenarios such as the electronic health records where the outcome is much more difficult to collect than the covariates. In this paper, we consider the linear regression problem with such a data structure under the high dimensionality. Our goal is to investigate when and how the unlabeled data can be exploited to improve the estimation and inference of the regression parameters in linear models, especially in light of the fact that such linear models may be misspecified in data analysis. In particular, we address the following two important questions. (1) Can we use the labeled data as well as the unlabeled data to construct a semi-supervised estimator such that its convergence rate is faster than the supervised estimators? (2) Can we construct confidence intervals or hypothesis tests that are guaranteed to be more efficient or powerful than the supervised estimators? To address the first question, we establish the minimax lower bound for parameter estimation in the semi-supervised setting. We show that the upper bound from the supervised estimators that only use the labeled data cannot attain this lower bound. We close this gap by proposing a new semi-supervised estimator which attains the lower bound. To address the second question, based on our proposed semi-supervised estimator, we propose two additional estimators for semi-supervised inference, the efficient estimator and the safe estimator. The former is fully efficient if the unknown conditional mean function is estimated consistently, but may not be more efficient than the supervised approach otherwise. The latter usually does not aim to provide fully efficient inference, but is guaranteed to be no worse than the supervised approach, no matter whether the linear model is correctly specified or the conditional mean function is consistently estimated.
翻译:在电子健康记录等许多情况下,结果比共变数据更难收集。 在本文中, 我们考虑在高维度下使用这种数据结构的线性回归问题。 我们的目标是调查何时以及如何利用未贴标签的数据来改进线性模型中回归参数的估算和推断, 特别是鉴于在数据分析中这种线性模型可能被错误地描述。 特别是, 我们处理以下两个重要问题:(1) 我们能否使用标签数据以及未贴标签数据来构建一个半监督的估测器, 以便其趋近率比受监督的估测器快? 我们的目标是调查何时以及如何利用未贴标签的数据来改进线性模型参数参数的估算和推断值, 以便改进线性模型的估算值, 特别是鉴于这种线性模型在数据分析中可能存在错误, 我们只能使用标定的数据才能达到这个更低的测距值。 我们通过提出新的半监督性估算值的准确度或假设值来缩小这一差距, 我们提出的新的半监督性估算值的估算值, 最终的估算值是完全的。