This paper considers statistical inference for the explained variance $\beta^{\intercal}\Sigma \beta$ under the high-dimensional linear model $Y=X\beta+\epsilon$ in the semi-supervised setting, where $\beta$ is the regression vector and $\Sigma$ is the design covariance matrix. A calibrated estimator, which efficiently integrates both labelled and unlabelled data, is proposed. It is shown that the estimator achieves the minimax optimal rate of convergence in the general semi-supervised framework. The optimality result characterizes how the unlabelled data contributes to the estimation accuracy. Moreover, the limiting distribution for the proposed estimator is established and the unlabelled data has also proven useful in reducing the length of the confidence interval for the explained variance. The proposed method is extended to the semi-supervised inference for the unweighted quadratic functional, $\|\beta\|_2^2$. The obtained inference results are then applied to a range of high-dimensional statistical problems, including signal detection and global testing, prediction accuracy evaluation, and confidence ball construction. The numerical improvement of incorporating the unlabelled data is demonstrated through simulation studies and an analysis of estimating heritability for a yeast segregant data set with multiple traits.
翻译:本文审议了在半监督的高维线性模型下解释差异的统计推断值 $\beta ⁇ intercal ⁇ sigma\beta$(Y=X=Beta ⁇ epsilon$) 。 在半监督环境下, 美元为回归矢量, 美元为回归量, 美元为设计共差矩阵, 美元为设计共差矩阵。 提议了一个校准估计器, 有效地将标签和无标签数据结合起来。 显示估计器在一般半监督框架内实现了最小最大最佳汇合率。 最佳性结果说明了未贴标签数据如何有助于估算准确性。 此外, 确定了拟议估算仪的分布限制, 而未贴标签的数据也证明有助于缩短解释差异的置信时间间隔长度。 提议的方法扩展至未加权的四方函数的半超度推断值, $ ⁇ beta ⁇ 2 ⁇ 2 ⁇ 2$。 获得的推论结果随后应用于一系列高维统计问题, 包括信号检测和量化数据测试的可靠性, 一种通过模拟数据测测测测度的可靠度, 和测算的数值测算的精确度, 。