Given a collection of $m$ sets from a universe $\mathcal{U}$, the Maximum Set Coverage problem consists of finding $k$ sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor $1-1/e$. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe $|\mathcal{U}|$. However, one randomized streaming algorithm has been shown to produce a $1-1/e-\varepsilon$ approximation of the optimal solution with a space complexity that scales only poly-logarithmically with respect to $m$ and $|\mathcal{U}|$. In order to achieve such a low space complexity, the authors used a technique called subsampling, based on independent-wise hash functions. This article focuses on this sublinear-space algorithm and introduces methods to reduce the time cost of subsampling. We first show how to accelerate by several orders of magnitude without altering the space complexity, number of passes and approximation quality of the original algorithm. Secondly, we derive a new lower bound for the probability of producing a $1-1/e-\varepsilon$ approximation using only pairwise independence: $1-\tfrac{4}{c k \log m}$ compared to the original $1-\tfrac{2e}{m^{ck/6}}$. Although the theoretical approximation guarantees are weaker, for large streams, our algorithm performs well in practice and present the best time-space-performance trade-off for maximum coverage in streams.
翻译:根据宇宙$\ mathcal{U} $的收集 $ 美元, 最大设置覆盖问题包括寻找以美元为单位的组合。 这个问题是 NP- Hard, 但解决方案可以被一个多式时间算法所近似, 最高为 1-1 美元/ 美元。 然而, 这个算法与输入大小不相称。 在流流背景下, 找到实用的高质量解决方案, 但与宇宙大小 $ mathcal{ { { 最大设置问题 。 然而, 一个随机化的流算法显示, 以美元为单位, 以美元为单位, 以美元为单位, 以美元为单位, 以美元为单位, 以直线性计算, 以直流为单位, 以直线性算为单位, 以美元为单位, 以直线性算为单位, 将最优的流法方法降低时间成本 。 我们通过原始的运行速度, 将快速地显示, 以目前最低的直径直径为单位, 。