In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset of the Big data is analysed and used as the basis for inference rather than considering the whole data set. A key question when applying sub-sampling approaches is how to select an informative subset based on the questions being asked of the data. A recent approach for this has been proposed based on determining sub-sampling probabilities for each data point, but a limitation of this approach is that appropriate sub-sampling probabilities rely on an assumed model for the Big data. In this article, to overcome this limitation, we propose a model robust approach where a set of models is considered, and the sub-sampling probabilities are evaluated based on the weighted average of probabilities that would be obtained if each model was considered singularly. Theoretical support for such an approach is provided. Our model robust sub-sampling approach is applied in a simulation study and in two real world applications where performance is compared to current sub-sampling practices. The results show that our model robust approach outperforms alternative approaches.
翻译:在当今的“大数据”时代,需要采用计算有效和可扩缩的方法来支持及时的洞察力和知情的决策。其中一种方法是子抽样,对大数据的子集进行分析,将其作为推理的基础,而不是整个数据集。在应用子抽样方法时的一个关键问题是,如何根据所询问的数据问题选择一个信息子集。为此提出了最近的办法,其依据是确定每个数据点的子抽样概率,但这一办法的一个局限性是,适当的子抽样概率依赖于大数据假设模型。为了克服这一限制,我们建议了一种强有力的模型方法,其中考虑到一组模型,根据每个模型被视作单一的概率加权平均值对子样本概率进行评估。为这种方法提供了理论支持。我们的模型强有力的子抽样方法在模拟研究中应用,在两个真实的世界应用中,将我们的业绩与目前的子模型模拟方法进行比较。结果显示,将我们的模型比照现有稳妥的方法。