We consider the problem of estimating a low-dimensional parameter in high-dimensional linear regression. Constructing an approximately unbiased estimate of the parameter of interest is a crucial step towards performing statistical inference. Several authors suggest to orthogonalize both the variable of interest and the outcome with respect to the nuisance variables, and then regress the residual outcome with respect to the residual variable. This is possible if the covariance structure of the regressors is perfectly known, or is sufficiently structured that it can be estimated accurately from data (e.g., the precision matrix is sufficiently sparse). Here we consider a regime in which the covariate model can only be estimated inaccurately, and hence existing debiasing approaches are not guaranteed to work. When errors in estimating the covariate model are correlated with errors in estimating the linear model parameter, an incomplete elimination of the bias occurs. We propose the Correlation Adjusted Debiased Lasso (CAD), which nearly eliminates this bias in some cases, including cases in which the estimation errors are neither negligible nor orthogonal. We consider a setting in which some unlabeled samples might be available to the statistician alongside labeled ones (semi-supervised learning), and our guarantees hold under the assumption of jointly Gaussian covariates. The new debiased estimator is guaranteed to cancel the bias in two cases: (1) when the total number of samples (labeled and unlabeled) is larger than the number of parameters, or (2) when the covariance of the nuisance (but not the effect of the nuisance on the variable of interest) is known. Neither of these cases is treated by state-of-the-art methods.
翻译:我们考虑了在高维线性回归中估算低维参数的问题。 构建对利息参数的大致不公正的估计是进行统计推断的关键一步。 几位作者建议将利息变量和结果与破坏变量有关, 然后将剩余变量的剩余结果反转。 如果回归者的共变结构完全为人所知, 或结构足够结构, 能够从数据中准确估计出低维参数( 例如, 精确参数不够少 ) 。 我们在这里考虑一个制度, 共变模型只能以不准确的方式估算, 因而现有的贬低方法不能保证起作用。 当估算共变模型中的差错与估算线性模型参数的差错相关时, 偏差就会被完全消除。 我们提出调校正调调的拉索( CADAD), 在某些案例中, 几乎消除了这种偏差, 包括估算误既不微不足道, 也不甚差处理。 我们考虑在某个制度中, 某些未标定的样本可能无法准确估计, 并且 与变量的模型的假设同时存在。