In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods designed to help the performance of subset-model approach the full-set-model, hence the weighted methods have no chance to acquire a subset-model that is better than the full-set-model. However, we question that how can we achieve better model with less data? In this work, we propose a novel Unweighted Influence Data Subsampling (UIDS) method, and prove that the subset-model acquired through our method can outperform the full-set-model. Besides, we show that overly confident on a given test set for sampling is common in Influence-based subsampling methods, which can eventually cause our subset-model's failure in out-of-sample test. To mitigate it, we develop a probabilistic sampling scheme to control the worst-case risk over all distributions close to the empirical distribution. The experiment results demonstrate our methods superiority over existed subsampling methods in diverse tasks, such as text classification, image classification, click-through prediction, etc.
翻译:在“大数据”时代,大规模数据集的培训复杂模型具有挑战性,因此,通过子抽样来减少数据量以节省计算资源,因此,培训大型数据集的复杂模型具有挑战性。在子抽样中,大多数以前的工作都是加权方法,旨在帮助子模型采用全套模型,因此加权方法没有机会获得比全套模型更好的子模型。然而,我们质疑我们如何用较少的数据取得更好的模型?在这项工作中,我们提出了一个新的“未加权数据分抽样(UIDS)”方法,并证明通过我们的方法获得的子模型能够超越全套模型。此外,我们表明,对特定抽样测试集过于自信,在基于影响的子模型方法中是常见的,这最终可能导致我们的子模型在全套模型测试中失败。为了减轻这一缺陷,我们制定了一种概率性抽样方法,以控制所有分布上最差的风险。实验结果表明,通过我们的方法优于全套模型,在多种任务中,例如文本分类,即图像的分类中,是分级的。