结合概率与非概率调查数据的去偏机器学习方法 (Debiased machine learning for combining probability and non-probability survey data)

We consider the problem of estimating the finite population mean $\bar{Y}$ of an outcome variable $Y$ using data from a nonprobability sample and auxiliary information from a probability sample. Existing double robust (DR) estimators of this mean $\bar{Y}$ require the estimation of two nuisance functions: the conditional probability of selection into the nonprobability sample given covariates $X$ that are observed in both samples, and the conditional expectation of $Y$ given $X$. These nuisance functions can be estimated using parametric models, but the resulting estimator of $\bar{Y}$ will typically be biased if both parametric models are misspecified. It would therefore be advantageous to be able to use more flexible data-adaptive / machine-learning estimators of the nuisance functions. Here, we develop a general framework for the valid use of DR estimators of $\bar{Y}$ when the design of the probability sample uses sampling without replacement at the first stage and data-adaptive / machine-learning estimators are used for the nuisance functions. We prove that several DR estimators of $\bar{Y}$, including targeted maximum likelihood estimators, are asymptotically normally distributed when the estimators of the nuisance functions converge faster than the $n^{1/4}$ rate and cross-fitting is used. We present a simulation study that demonstrates good performance of these DR estimators compared to the corresponding DR estimators that rely on at least one correctly specified parametric model.

翻译：我们考虑利用非概率样本数据和概率样本的辅助信息来估计结果变量Y的有限总体均值$\bar{Y}$。现有关于该均值$\bar{Y}$的双稳健（DR）估计量需要估计两个干扰函数：给定在两个样本中均观测到的协变量X，个体被选入非概率样本的条件概率，以及给定X时Y的条件期望。这些干扰函数可通过参数模型进行估计，但如果两个参数模型均设定错误，则由此得到的$\bar{Y}$估计量通常会产生偏差。因此，若能使用更具灵活性的数据自适应/机器学习方法来估计干扰函数将具有显著优势。本文构建了一个通用框架，用于在概率样本设计采用第一阶段无放回抽样且使用数据自适应/机器学习方法估计干扰函数时，有效运用$\bar{Y}$的双稳健估计量。我们证明，当干扰函数估计量的收敛速度快于$n^{1/4}$速率且采用交叉拟合策略时，包括定向最大似然估计量在内的多种$\bar{Y}$双稳健估计量均具有渐近正态性。通过模拟研究，我们展示了这些双稳健估计量相较于依赖至少一个正确设定参数模型的对应双稳健估计量具有更优的性能。