The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. In our work, we seek to solve the problem at its source, collecting only valuable data and throwing out the rest, via active learning. We propose an online algorithm which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our analysis also holds for the federated setting, in which multiple agents select online from individual data streams without coordination and with potentially very different appraisals of cost. One particularly important use case is selecting and labeling training sets from unlabeled collections of data that maximize the test-time performance of a given classifier. In prediction tasks on ImageNet and MNIST, we show that our selection method outperforms random selection by up to 5-20%.
翻译:支持无处不在的数据也伴随着一个诅咒: 通信、 存储和标注大规模, 大多是多余的数据集。 在我们的工作中, 我们寻求解决问题的来源, 只收集有价值的数据, 并通过积极学习丢弃其余数据。 我们建议在线算法, 以任何数据流、 对其价值的评估 及其选择成本的任何公式, 将流中最有价值的子子集提取到一个恒定系数, 同时使用最小的内存 。 值得注意的是, 我们的分析也保留在联合设置中, 即多个代理商从单个数据流中选择在线, 没有协调, 并且可能与成本评估非常不同 。 一个特别重要的使用案例是选择和标注来自未标定的数据集的培训组, 以最大限度地提高特定分类员的测试时间性能。 在图像网和 MNIST 的预测任务中, 我们显示我们的选择方法比随机选择率高达5-20 % 。