We propose a novel CDF estimator that integrates data from probability samples with data from, potentially big, nonprobability samples. Assuming that a set of shared covariates are observed in both, while the response variable is observed only in the latter, the proposed estimator uses a survey-weighted empirical CDF of regression residuals trained on the convenience sample to estimate the CDF of the response variable. Under some assumptions, we derive the asymptotic bias and variance of our CDF estimator and show that it is asymptotically unbiased for the finite population CDF if ignorability holds. Empirical results demonstrate that the estimator performs well under model misspecification when ignorability holds, and under nonignorable sampling when the outcome model is correctly specified. Even when both assumptions fail, the residual-based estimator continues to outperform its plug-in and na\"ive counterparts, albeit with noted decreases in efficiency.
翻译:暂无翻译