Subsampling is an important technique to tackle the computational challenges brought by big data. Many subsampling procedures fall within the framework of importance sampling, which assigns high sampling probabilities to the samples appearing to have big impacts. When the noise level is high, those sampling procedures tend to pick many outliers and thus often do not perform satisfactorily in practice. To tackle this issue, we design a new Markov subsampling strategy based on Huber criterion (HMS) to construct an informative subset from the noisy full data; the constructed subset then serves as a refined working data for efficient processing. HMS is built upon a Metropolis-Hasting procedure, where the inclusion probability of each sampling unit is determined using the Huber criterion to prevent over scoring the outliers. Under mild conditions, we show that the estimator based on the subsamples selected by HMS is statistically consistent with a sub-Gaussian deviation bound. The promising performance of HMS is demonstrated by extensive studies on large scale simulations and real data examples.
翻译:子取样是应对海量数据带来的计算挑战的重要方法。 许多子取样程序属于重要取样框架的范围,它给样本的抽样概率定得很高,似乎具有很大的影响。当噪音水平高时,这些取样程序往往会挑选许多外出点,因此在实践中往往不能令人满意地发挥作用。为了解决这一问题,我们根据Huber标准设计了一个新的Markov子取样战略,以从噪音的完整数据中构建一个信息子集;然后,建造的子集作为高效处理的精细工作数据。HMS建立在大都会-Hasting程序的基础上,在这个程序的基础上,每个采样单位的列入概率是使用Huber标准来决定的,以防止超过外出点。在温和的条件下,我们表明基于HMS所选的子样本的测算器在统计上与亚高加索偏离界限相一致。大型模拟和真实数据实例的广泛研究表明HMS有良好的表现。