C-OPH:用环形变异改善单变异散列的准确性 (C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations)

Minwise hashing (MinHash) is a classical method for efficiently estimating the Jaccrad similarity in massive binary (0/1) data. To generate $K$ hash values for each data vector, the standard theory of MinHash requires $K$ independent permutations. Interestingly, the recent work on "circulant MinHash" (C-MinHash) has shown that merely two permutations are needed. The first permutation breaks the structure of the data and the second permutation is re-used $K$ time in a circulant manner. Surprisingly, the estimation accuracy of C-MinHash is proved to be strictly smaller than that of the original MinHash. The more recent work further demonstrates that practically only one permutation is needed. Note that C-MinHash is different from the well-known work on "One Permutation Hashing (OPH)" published in NIPS'12. OPH and its variants using different "densification" schemes are popular alternatives to the standard MinHash. The densification step is necessary in order to deal with empty bins which exist in One Permutation Hashing. In this paper, we propose to incorporate the essential ideas of C-MinHash to improve the accuracy of One Permutation Hashing. Basically, we develop a new densification method for OPH, which achieves the smallest estimation variance compared to all existing densification schemes for OPH. Our proposed method is named C-OPH (Circulant OPH). After the initial permutation (which breaks the existing structure of the data), C-OPH only needs a "shorter" permutation of length $D/K$ (instead of $D$), where $D$ is the original data dimension and $K$ is the total number of bins in OPH. This short permutation is re-used in $K$ bins in a circulant shifting manner. It can be shown that the estimation variance of the Jaccard similarity is strictly smaller than that of the existing (densified) OPH methods.

翻译：MinHash (MinHash) 是一个典型的方法, 用于以大规模二进制 (0/1) 数据来高效估算雅克的相似性。为了为每个数据矢量生成 $ K$ hash 值, MinHash 的标准理论需要 $ K$ 独立的排列。有趣的是, 最近关于 C- MinHash (C- MinHash) 的工作显示, 只需要两次对齐。第一次对齐打破数据结构, 而第二次对齐以循环方式重新使用 $( 美元 ) 。令人惊讶的是, C- MinHash 的估算准确性被证明比原始 MinHash 的更小。最近的工作进一步表明, 几乎只需要一次对齐。注意, C- MinHash 与在 NIPS 12 中出版的“ 一次性变色( OPH) ” 的工作不同。 OPH 及其变异性( 使用不同的“ 硬化” 元数据是标准初始 Hash 的流行替代方法。。。硬化的 C- HinH 的计算步骤是, 最硬的缩的缩的缩的缩的缩化步骤,, 需要在硬化的缩的缩的缩化的缩化的缩化的缩化的缩化, 和硬化的缩化的缩化的缩化的缩化的缩化的缩化的缩的缩化, 。