Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of MinHash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search in massive data. In this paper, we propose {\bf Circulant MinHash (C-MinHash)} and provide the surprising theoretical results that we just need \textbf{two} independent random permutations. For C-MinHash, we first conduct an initial permutation on the data vector, then we use a second permutation to generate hash values. Basically, the second permutation is re-used $K$ times via circulant shifting to produce $K$ hashes. Unlike classical MinHash, these $K$ hashes are obviously correlated, but we are able to provide rigorous proofs that we still obtain an unbiased estimate of the Jaccard similarity and the theoretical variance is uniformly smaller than that of the classical MinHash with $K$ independent permutations. The theoretical proofs of C-MinHash require some non-trivial efforts. Numerical experiments are conducted to justify the theory and demonstrate the effectiveness of C-MinHash.
翻译:MinHash (MinHash) 是一个重要而实用的算法, 用来生成随机的杂质, 以近似于大型二进制( 0/1) 的数据。 MinHash 的基本理论要求对数据集中的每个数据矢量应用数百甚至数千个独立的随机随机排列, 以便获得可靠的结果( 例如) 建立大型学习模型或近邻搜索大规模数据 。 在本文中, 我们提议 rbf Circulant MinHash (C- MinHash), 并提供出乎意料的理论结果, 我们只需在大型二进制数据中提供 & textbf{ 2} 独立的随机排列。 对于 C- MinHash 的基本理论, 我们首先对数据矢量进行初步的初始调整, 然后我们再使用第二个变异性来生成 hash 值 。 基本上, 第二个变异性是重新使用 $K 来生产 $K$ hash 。 与古典 MinHash (C- Kash) 不同, 这些美元有明显的关联性, 但我们能够提供严格的证据证明 C- minH lishalalalblialalal roalalalal 需要我们仍然获得 C.