Big Data often presents as massive non-probability samples. Not only is the selection mechanism often unknown, but larger data volume amplifies the relative contribution of selection bias to total error. Existing bias adjustment approaches assume that the conditional mean structures have been correctly specified for the selection indicator or key substantive measures. In the presence of a reference probability sample, these methods rely on a pseudo-likelihood method to account for the sampling weights of the reference sample, which is parametric in nature. Under a Bayesian framework, handling the sampling weights is an even bigger hurdle. To further protect against model misspecification, we expand the idea of double robustness such that more flexible non-parametric methods as well as Bayesian models can be used for prediction. In particular, we employ Bayesian additive regression trees, which not only capture non-linear associations automatically but permit direct quantification of the uncertainty of point estimates through its posterior predictive draws. We apply our method to sensor-based naturalistic driving data from the second Strategic Highway Research Program using the 2017 National Household Travel Survey as a benchmark.
翻译:大数据通常以大规模非概率抽样形式呈现,而选择机制往往不为人知,但数据量较大,不仅扩大了选择偏向对总误差的相对贡献。现有的偏向调整方法假定选择指标或关键实质性措施的有条件平均结构已经正确指定。在出现参考概率抽样时,这些方法依靠假似方法来计算参考样本的抽样权重,这种样本具有参数性。在巴伊西亚框架下,处理抽样权重是一个更大的障碍。为了进一步防范模型误差,我们扩大了双重稳健性的概念,这样可以使用更灵活的非参数方法以及巴伊西亚模型来进行预测。特别是,我们采用了巴伊西亚添加回归树,不仅自动地捕捉非线性联系,而且允许通过它的后方预测图直接量化点估计的不确定性。我们用我们的方法对以2017年国家家庭旅行调查为基准的第二个战略高速公路研究方案中基于传感器的自然驱动数据进行了计算。