We consider inference from non-random samples in data-rich settings where high-dimensional auxiliary information is available both in the sample and the target population, with survey inference being a special case. We propose a regularized prediction approach that predicts the outcomes in the population using a large number of auxiliary variables such that the ignorability assumption is reasonable while the Bayesian framework is straightforward for quantification of uncertainty. Besides the auxiliary variables, inspired by Little & An (2004), we also extend the approach by estimating the propensity score for a unit to be included in the sample and also including it as a predictor in the machine learning models. We show through simulation studies that the regularized predictions using soft Bayesian additive regression trees yield valid inference for the population means and coverage rates close to the nominal levels. We demonstrate the application of the proposed methods using two different real data applications, one in a survey and one in an epidemiology study.
翻译:我们认为,在数据丰富、抽样和目标人口都可获得高维辅助信息的环境下,从非随机抽样中推断出数据丰富,调查推断为特例。我们建议采用常规化预测方法,利用大量辅助变量预测人口结果,这样,忽略假设是合理的,而巴伊西亚框架则直截了当地量化不确定性。除了由Little & An (2004年)所启发的辅助变量外,我们还扩大了这一方法的范围,估计了将纳入抽样的单位的倾向性分数,并将该单位作为预测者列入机器学习模型。我们通过模拟研究发现,使用软贝伊西亚的叠加回归树进行的正规化预测对人口手段和覆盖率产生有效推论,接近于名义水平。我们用两种不同的真实数据应用方法展示了应用情况:一个在调查中,一个在流行病学研究中。