Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.
翻译:子抽样算法是一种自然的方法,在安装大规模数据集模型之前就减少数据大小,近年来,若干工作提出了从数据矩阵中从数据矩阵中进行子抽样的方法,同时保留相关的分类信息,虽然这些工作得到理论和有限实验的支持,但迄今尚未对这些方法进行全面评估。在我们的工作中,我们直接比较了从核心集和最佳子抽样文献中提取的多种后勤回归方法,并发现这些方法的效力不一致。在许多情况下,方法并不优于简单统一的子抽样。