We study a linear high-dimensional regression model in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no response $Y$. We do not make any sparsity assumptions on the vector of coefficients, and aim at estimating $\mathrm{Var}(Y|X)$. We propose an estimator, which is unbiased, consistent, and asymptotically normal. This estimator can be improved by adding zero-estimators arising from the unlabelled data. Adding zero-estimators does not affect the bias and potentially can reduce variance. In order to achieve optimal improvement, many zero-estimators should be used, but this raises the problem of estimating many parameters. Therefore, we introduce covariate selection algorithms that identify which zero-estimators should be used in order to improve the above estimator. We further illustrate our approach for other estimators, and present an algorithm that improves estimation for any given variance estimator. Our theoretical results are demonstrated in a simulation study.
翻译:我们在一个半监督的环境中研究一个线性高维回归模型, 对于许多观测而言, 仅给出共差的矢量X$, 而没有回应美元。 我们没有对系数的矢量做出任何夸度假设, 目的是估算$mathrm{Var}( Y ⁇ X) 美元。 我们提出一个估计值, 其不偏袒、 一致和无症状的正常。 这个估计值可以通过添加未标数据产生的零估计值来改进。 添加零估计值不会影响偏差, 并且可能减少差异。 为了实现最佳改进, 应该使用许多零估计值, 但这就提出了估算许多参数的问题。 因此, 我们引入了计算参数的共变选择法, 确定哪些零估计值应该用来改进上述估计值。 我们进一步为其他估计值展示了我们的方法, 并展示了一种改进任何差异估计值估计值的算法。 我们的理论结果在模拟研究中得到了证明 。