Data summarization has become a valuable tool in understanding even terabytes of data. Due to their compelling theoretical properties, submodular functions have been in the focus of summarization algorithms. These algorithms offer worst-case approximations guarantees to the expense of higher computation and memory requirements. However, many practical applications do not fall under this worst-case, but are usually much more well-behaved. In this paper, we propose a new submodular function maximization algorithm called ThreeSieves, which ignores the worst-case, but delivers a good solution in high probability. It selects the most informative items from a data-stream on the fly and maintains a provable performance on a fixed memory budget. In an extensive evaluation, we compare our method against $6$ other methods on $8$ different datasets with and without concept drift. We show that our algorithm outperforms current state-of-the-art algorithms and, at the same time, uses fewer resources. Last, we highlight a real-world use-case of our algorithm for data summarization in gamma-ray astronomy. We make our code publicly available at https://github.com/sbuschjaeger/SubmodularStreamingMaximization.
翻译:数据总和已经成为理解甚至数据百万字节的宝贵工具。 由于其令人信服的理论属性, 子模块函数一直处于总化算法的焦点。 这些算法提供了最坏情况的近似保证, 以更高的计算和记忆要求为代价。 然而, 许多实际应用并不属于最坏的情况, 但是通常要更加守规矩。 在本文中, 我们提议一个新的子模块函数最大化算法, 叫做“ 三赛维斯 ”, 它忽略了最坏的情况, 但提供了一种非常可能的良好解决方案。 它选择了来自苍蝇上的数据流中信息最丰富的项目, 并在固定的记忆预算上保持了一种可变的性能。 在一项广泛的评估中, 我们比较了我们的方法, 在8美元不同的数据集上, 并且没有概念的漂移, 。 我们显示我们的算法优于当前最先进的算法, 同时使用的资源也更少。 最后, 我们强调我们用于伽玛射线天文学中的数据总和算法的实世应用案例。 我们通过 https:// magres/Mabexmasialalalalalal。