The booming vector manage system calls for feasible similarity hash function to perform similarity analysis. In this paper, we make a systematically survey on the existent well-known similarity hash functions to tease out the satisfied ones. We conclude that the similarity hash function MinHash, Nilsimsa can be directly marshaled into the pipeline of similarity analysis using vector manage system. After that, we make a brief and empirical discussion on the performance, drawbacks of the these functions and highlight MinHash, the variant of SimHash and feature hashing are the best for vector management system for large-scale similarity analysis.
翻译:兴起的病媒管理系统需要可行的相似散列函数来进行相似性分析。在本文中,我们系统地调查现有众所周知的相似散列函数,以挑拨满意的散列函数。我们的结论是,利用病媒管理系统,可将相似性散列函数MinHash、Nilsimsa直接纳入相似性分析的管道。之后,我们对这些功能的性能、缺陷进行简要和实证性的讨论,并突出MinHash,SimHash和特征散列的变种是大规模相似性分析的病媒管理系统的最佳途径。