The emergence of massive data in recent years brings challenges to automatic statistical inference. This is particularly true if the data are too numerous to be read into memory as a whole. Accordingly, new sampling techniques are needed to sample data from a hard drive. In this paper, we propose a sequential addressing subsampling (SAS) method, that can sample data directly from the hard drive. The newly proposed SAS method is time saving in terms of addressing cost compared to that of the random addressing subsampling (RAS) method. Estimators (e.g., the sample mean) based on the SAS subsamples are constructed, and their properties are studied. We conduct a series of simulation studies to verify the finite sample performance of the proposed SAS estimators. The time cost is also compared between the SAS and RAS methods. An analysis of the airline data is presented for illustration purpose.
翻译:近年来大量数据的出现给自动统计推断带来了挑战。如果数据数量过多,无法从整体记忆中读取,情况尤其如此。因此,需要采用新的取样技术来从硬盘中取样数据。在本文件中,我们提议了一种按顺序处理子抽样的方法,可以直接从硬盘中取样数据。新提议的SAS方法在处理费用方面节省时间,与随机处理子抽样(RAS)方法相比较。根据SAS子样本(例如,样本平均数)进行估算,并研究其特性。我们进行一系列模拟研究,以核实拟议的SAS估计数字的有限抽样性能。时间成本也在SAS和RAS方法之间进行比较。对航空数据的分析是为了说明目的。