MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.
翻译:MinHash 和 HyperLogLog 是一种草图算法,对于大数据应用的设定摘要来说,这些算法是不可或缺的。 虽然超LogLog允许用很小的空间来计算不同的元素, 但 MinHash 适合对各组进行快速比较, 因为它可以估算“ 贾卡相似性” 和其他联合数量 。 这项工作提出了一个名为 SetSketch 的新数据结构, 能够持续填补两个使用案例之间的空白。 它的通和极能插入操作及其合并状态使得它适合分布式环境。 快速、 强大、 易于执行的基点和联合数量的估测器, 以及使用 Setsketch 进行相似性搜索的能力, 启用多功能应用程序 。 所提出的联合估计器也可以应用到其他数据结构, 如 MinHash、 超LogLog 或超MentMinHash 等, 在许多情况下, 其表现甚至优于相应的州艺术估计器 。