Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHash (Li and Li, 2021) has shown, with rigorous proofs, that only two permutations are needed. An initial permutation is applied to break whatever structures which might exist in the data, and a second permutation is re-used $K$ times to produce $K$ hashes, via a circulant shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even though the $K$ hashes are correlated, the estimation variance is strictly smaller than the variance of the traditional MinHash. It has been demonstrated in (Li and Li, 2021) that the initial permutation in C-MinHash is indeed necessary. For the ease of theoretical analysis, they have used two independent permutations. In this paper, we show that one can actually simply use one permutation. That is, one single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes. Although the theoretical analysis becomes very complicated, we are able to explicitly write down the expression for the expectation of the estimator. The new estimator is no longer unbiased but the bias is extremely small and has essentially no impact on the estimation accuracy (mean square errors). An extensive set of experiments are provided to verify our claim for using just one permutation.
翻译:传统的硬盘 hashing ( MinHash) 需要应用 $ K$ 的独立独立调整来估算大二进制 (0/1) 数据中的 ACC 相似性, 其中K$可以( 例如) 1024 或更大, 取决于应用程序。 C- MinHash (Li和Li, 2021) 最近的工作显示, 只要有严格的证明, C- MinHash (Li 和Li, 2021) 只需两次调整即可。 初始调整用于打破数据中可能存在的任何结构, 而第二次调整将重新使用 $K 美元 来重新使用 loaddal 来生成 $KHes 。 在本文中, 以 Circuraplant 格式的变化方式, 显示一个步骤可以直接使用一个步骤来计算 。 该步骤可以生成一个步骤来计算一个步骤 。 该步骤可以生成一个步骤 。 该步骤可以生成一个步骤 。 该步骤可以生成一个步骤来进行一个步骤 。 该步骤用于 。 该步骤的精确的计算 。