Administrative data, or non-probability sample data, are increasingly being used to obtain official statistics due to their many benefits over survey methods. In particular, they are less costly, provide a larger sample size, and are not reliant on the response rate. However, it is difficult to obtain an unbiased estimate of the population mean from such data due to the absence of design weights. Several estimation approaches have been proposed recently using an auxiliary probability sample which provides representative covariate information of the target population. However, when this covariate information is high-dimensional, variable selection is not a straight-forward task even for a subject matter expert. In the context of efficient and doubly robust estimation approaches for estimating a population mean, we develop two data adaptive methods for variable selection using the outcome adaptive LASSO and a collaborative propensity score, respectively. Simulation studies are performed in order to verify the performance of the proposed methods versus competing methods. Finally, we presented an anayisis of the impact of Covid-19 on Canadians.
翻译:行政数据,或非概率抽样数据,由于在调查方法上有许多好处,正越来越多地被用于获取官方统计,特别是,这些数据费用较低,抽样规模较大,不依赖答复率;然而,由于缺乏设计权重,很难从这些数据中获得对人口平均数的公正估计;最近提出了几种估算方法,采用辅助概率抽样,为目标人口提供具有代表性的共变信息;然而,当这种共变信息为高维度时,即使对主题专家来说,选择变量也不是直截了当的任务。在对人口值进行高效和加倍有力的估计时,我们分别利用适应LASSO的结果和协作性适应性分数,为变量选择制定了两种数据适应性方法。进行了模拟研究,以核实拟议方法的绩效和相互竞争的方法。最后,我们介绍了Covid-19对加拿大人的影响。