适应子模块最大化的最佳抽样差距 (Optimal Sampling Gaps for Adaptive Submodular Maximization)

Running machine learning algorithms on large and rapidly growing volumes of data is often computationally expensive, one common trick to reduce the size of a data set, and thus reduce the computational cost of machine learning algorithms, is \emph{probability sampling}. It creates a sampled data set by including each data point from the original data set with a known probability. Although the benefit of running machine learning algorithms on the reduced data set is obvious, one major concern is that the performance of the solution obtained from samples might be much worse than that of the optimal solution when using the full data set. In this paper, we examine the performance loss caused by probability sampling in the context of adaptive submodular maximization. We consider a simple probability sampling method which selects each data point with probability at least $r\in[0,1]$. If we set $r=1$, our problem reduces to finding a solution based on the original full data set. We define sampling gap as the largest ratio between the optimal solution obtained from the full data set and the optimal solution obtained from the samples, over independence systems. Our main contribution is to show that if the sampling probability of each data point is at least $r$ and the utility function is policywise submodular, then the sampling gap is both upper bounded and lower bounded by $1/r$. We show that the property of policywise submodular can be found in a wide range of real-world applications, including pool-based active learning and adaptive viral marketing.

翻译：在大规模和迅速增长的数据量上运行机器学习算法往往计算费用昂贵,减少数据集规模,从而降低机器学习算法计算成本的常见常见伎俩是 emph{概率抽样}。它通过从原始数据集中包含已知概率的每个数据点来创建抽样数据集。虽然在减少的数据集上运行机器学习算法的好处是显而易见的,但一个主要关切是,在使用完整数据集时,从抽样中获得的解决方案的性能可能比最佳解决方案的性能差得多。在本文中,我们研究了在适应性亚模块最大化背景下取样概率造成的性能损失。我们考虑一种简单的概率抽样方法,选择每个数据点的概率至少为$r\in[0,1,1美元。如果我们设定$=1美元,我们的问题就会降低到在原始完整数据集上找到一个解决方案。我们把抽样差距定义为从完整数据集中获得的最佳解决方案和从样本中获得的最佳解决方案之间的最大比率。在独立系统中,我们的主要贡献是显示,如果每个数据的采样概率, 包括每个工具值的高级工具值,那么,每个数据值的基值的基数值的值是最低的基值值值值值值值值值值,那么,那么,每个基值的基值的基值的基值的基值的基值的基值的基值的基值的基值的基值值值值值的基值的基值值值值值的基值的基值值值值值值值值值值值值值是最低值值值是最低值是最低值值值值值值值值值值值值值值值值值的基值值值值值值值值值值值值值值值值值值值值值的基值值值值值值值值值值值值值值值值值值值值值值值值值,然后的基值的基值值值值值值值值值值值值值值是最低的基值的基值的基值的基值值值值值值的基值的基值的基值值值值值值值值值值值值值的基值的基值的基值的基值的基值的基值的基值值值值值值值值值值值值值值值值值值值值值值值值值值值值值