C-MinHash: 将1美元改为2美元 (C-MinHash: Rigorously Reducing $K$ Permutations to Two) - 专知论文

会员服务 ·

0

可约的 · 相互独立的 · 哈希学习 · 向量化 · 无偏估计 ·

2021 年 9 月 7 日

C-MinHash: Rigorously Reducing $K$ Permutations to Two

翻译：C-MinHash: 将1美元改为2美元

Xiaoyun Li,Ping Li

Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of MinHash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search in massive data. In this paper, we propose {\bf Circulant MinHash (C-MinHash)} and provide the surprising theoretical results that we just need \textbf{two} independent random permutations. For C-MinHash, we first conduct an initial permutation on the data vector, then we use a second permutation to generate hash values. Basically, the second permutation is re-used $K$ times via circulant shifting to produce $K$ hashes. Unlike classical MinHash, these $K$ hashes are obviously correlated, but we are able to provide rigorous proofs that we still obtain an unbiased estimate of the Jaccard similarity and the theoretical variance is uniformly smaller than that of the classical MinHash with $K$ independent permutations. The theoretical proofs of C-MinHash require some non-trivial efforts. Numerical experiments are conducted to justify the theory and demonstrate the effectiveness of C-MinHash.

翻译：MinHash (MinHash) 是一个重要而实用的算法, 用来生成随机的杂质, 以近似于大型二进制( 0/1) 的数据。 MinHash 的基本理论要求对数据集中的每个数据矢量应用数百甚至数千个独立的随机随机排列, 以便获得可靠的结果( 例如) 建立大型学习模型或近邻搜索大规模数据。在本文中, 我们提议 rbf Circulant MinHash (C- MinHash), 并提供出乎意料的理论结果, 我们只需在大型二进制数据中提供 & textbf{ 2} 独立的随机排列。对于 C- MinHash 的基本理论, 我们首先对数据矢量进行初步的初始调整, 然后我们再使用第二个变异性来生成 hash 值。基本上, 第二个变异性是重新使用 $K 来生产 $K$ hash 。与古典 MinHash (C- Kash) 不同, 这些美元有明显的关联性, 但我们能够提供严格的证据证明 C- minH lishalalalblialalal roalalalal 需要我们仍然获得 C.

0

相关内容

可约的

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

ICML 2021论文收录

ICML 2021论文收录

专知会员服务

123+阅读 · 2021年5月8日

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

专知会员服务

75+阅读 · 2021年1月10日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【剑桥大学】图网络的主邻域聚合，Principal Neighbourhood Aggregation for Graph Nets

【剑桥大学】图网络的主邻域聚合，Principal Neighbourhood Aggregation for Graph Nets

专知会员服务

42+阅读 · 2020年4月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【Freddy Lecue博士】Thales嵌入式可解释AI：关键系统中AI的采用（Thales Embedded Explainable AI: Towards the Adoption of AI in Critical Systems.），AI Accelerator Summit 2019

【Freddy Lecue博士】Thales嵌入式可解释AI：关键系统中AI的采用（Thales Embedded Explainable AI: Towards the Adoption of AI in Critical Systems.），AI Accelerator Summit 2019

专知会员服务

21+阅读 · 2019年11月11日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

已删除

创业邦杂志

5+阅读 · 2019年3月27日

Deterministic enumeration of all minimum cut-sets and $k$-cut-sets in hypergraphs for fixed $k$

Deterministic enumeration of all minimum cut-sets and $k$-cut-sets in hypergraphs for fixed $k$

Arxiv

0+阅读 · 2021年10月29日

Sliding window strategy for convolutional spike sorting with Lasso : Algorithm, theoretical guarantees and complexity

Arxiv

0+阅读 · 2021年10月29日

Structure learning in polynomial time: Greedy algorithms, Bregman information, and exponential families

Arxiv

0+阅读 · 2021年10月28日

Geometric two-scale integrators for highly oscillatory system: uniform accuracy and near conservations

Arxiv

0+阅读 · 2021年10月28日

Engineering Uniform Sampling of Graphs with a Prescribed Power-law Degree Sequence

Arxiv

0+阅读 · 2021年10月28日

Unveiling the structure of wide flat minima in neural networks

Arxiv

0+阅读 · 2021年10月27日

Scattering and uniform in time error estimates for splitting method in NLS

Arxiv

0+阅读 · 2021年10月27日

Products of Euclidean metrics and applications to proximity questions among curves

Arxiv

3+阅读 · 2020年4月13日

On orthogonal projections for dimension reduction and applications in variational loss functions for learning problems

Arxiv

3+阅读 · 2019年1月22日

Variance-based regularization with convex objectives

Arxiv

5+阅读 · 2017年12月14日

VIP会员

文章信息

相关主题

相互独立的

相关VIP内容

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

ICML 2021论文收录

ICML 2021论文收录

专知会员服务

123+阅读 · 2021年5月8日

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

专知会员服务

75+阅读 · 2021年1月10日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【剑桥大学】图网络的主邻域聚合，Principal Neighbourhood Aggregation for Graph Nets

【剑桥大学】图网络的主邻域聚合，Principal Neighbourhood Aggregation for Graph Nets

专知会员服务

42+阅读 · 2020年4月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【Freddy Lecue博士】Thales嵌入式可解释AI：关键系统中AI的采用（Thales Embedded Explainable AI: Towards the Adoption of AI in Critical Systems.），AI Accelerator Summit 2019

【Freddy Lecue博士】Thales嵌入式可解释AI：关键系统中AI的采用（Thales Embedded Explainable AI: Towards the Adoption of AI in Critical Systems.），AI Accelerator Summit 2019

专知会员服务

21+阅读 · 2019年11月11日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

已删除

创业邦杂志

5+阅读 · 2019年3月27日

相关论文

Deterministic enumeration of all minimum cut-sets and $k$-cut-sets in hypergraphs for fixed $k$

Deterministic enumeration of all minimum cut-sets and $k$-cut-sets in hypergraphs for fixed $k$

Arxiv

0+阅读 · 2021年10月29日

Sliding window strategy for convolutional spike sorting with Lasso : Algorithm, theoretical guarantees and complexity

Arxiv

0+阅读 · 2021年10月29日

Structure learning in polynomial time: Greedy algorithms, Bregman information, and exponential families

Arxiv

0+阅读 · 2021年10月28日

Geometric two-scale integrators for highly oscillatory system: uniform accuracy and near conservations

Arxiv

0+阅读 · 2021年10月28日

Engineering Uniform Sampling of Graphs with a Prescribed Power-law Degree Sequence

Arxiv

0+阅读 · 2021年10月28日

Unveiling the structure of wide flat minima in neural networks

Arxiv

0+阅读 · 2021年10月27日

Scattering and uniform in time error estimates for splitting method in NLS

Arxiv

0+阅读 · 2021年10月27日

Products of Euclidean metrics and applications to proximity questions among curves

Arxiv

3+阅读 · 2020年4月13日

On orthogonal projections for dimension reduction and applications in variational loss functions for learning problems

Arxiv

3+阅读 · 2019年1月22日

Variance-based regularization with convex objectives

Arxiv

5+阅读 · 2017年12月14日

微信扫码咨询专知VIP会员