In the era of data science, it is common to encounter data with different subsets of variables obtained for different cases. An example is the split questionnaire design (SQD), which is adopted to reduce respondent fatigue and improve response rates by assigning different subsets of the questionnaire to different sampled respondents. A general question then is how to estimate the regression function based on such block-wise observed data. Currently, this is often carried out with the aid of missing data methods, which may unfortunately suffer intensive computational cost, high variability, and possible large modeling biases in real applications. In this article, we develop a novel approach for estimating the regression function for SQD-type data. We first construct a list of candidate models using available data-blocks separately, and then combine the estimates properly to make an efficient use of all the information. We show the resulting averaged model is asymptotically optimal in the sense that the squared loss and risk are asymptotically equivalent to those of the best but infeasible averaged estimator. Both simulated examples and an application to the SQD dataset from the European Social Survey show the promise of the proposed method.
翻译:在数据科学时代,通常会遇到不同类别变量的数据,例如,不同的问卷设计(SQD)是用来减少答卷人的疲劳,提高答复率的,方法是将调查表的不同组别分配给不同的抽样答卷人。然后,一个一般性的问题是如何根据这种块状观察的数据估计回归功能。目前,这项工作往往是在缺少数据方法的帮助下进行的,不幸的是,这些方法可能会在实际应用中遭受密集的计算成本、高变异性和可能存在的巨大模型偏差。在本篇文章中,我们制定了一种新颖的方法来估计SQD类数据的回归函数。我们首先用现有的数据块单独编制候选模型清单,然后适当地合并这些估计数,以便有效地利用所有信息。我们从中可以看出,由此得出的平均模型是同样最理想的,即平方损失和风险与最佳但不可行的平均估计数据组等同。我们模拟了范例,并将SQD数据集应用到欧洲社会调查中,显示了拟议方法的前景。