The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via submodular maximization. Specifically, we develop algorithms for the online and distributed version of the problem, where data selection occurs in an uncoordinated fashion across multiple data streams. We design a general and flexible core selection routine for our algorithms which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our methods have the same theoretical guarantees as their offline counterparts, and, as far as we know, provide the first guarantees for online distributed submodular optimization in the literature. Finally, in learning tasks on ImageNet and MNIST, we show that our selection methods outperform random selection by $5-20\%$.
翻译:支持无处不在的数据也伴随着一个诅咒: 通信、 储存和标注大量、 大多是多余的数据集。 我们试图从核心解决该问题, 仅仅收集有价值的数据, 并通过子模块最大化丢弃其余部分。 具体地说, 我们为在线和分布式的问题开发了算法, 数据选择在多个数据流中以不协调的方式发生。 我们设计了一个通用和灵活的核心选择程序, 用于我们的算法, 以任何数据流、 对其价值的任何评估, 以及对其选择成本的任何配制, 将流中最有价值的子集提取到一个恒定系数, 同时使用最小的内存 。 值得注意的是, 我们的方法具有与离线对应方相同的理论保障, 并且据我们所知, 为在线分布的子模块优化提供了第一个保障。 最后, 在对图像网络和 MNIST 的学习任务中, 我们展示了我们的选择方法比随机选择要高出5-20 $ 。