Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the community. Due to its constant-time complexity independent of the values of the weights, Improved CWS (ICWS) is considered as the state-of-the-art CWS algorithm. In this paper, we revisit ICWS and analyze its underlying mechanism to show that there actually exists dependence between the two components of the hash-code produced by ICWS, which violates the condition of independence. To remedy the problem, we propose an Improved ICWS (I$^2$CWS) algorithm which not only shares the same theoretical computational complexity as ICWS but also abides by the required conditions of the CWS scheme. The experimental results on a number of synthetic data sets and real-world text data sets demonstrate that our I$^2$CWS algorithm can estimate the Jaccard similarity more accurately, and also compete with or outperform the compared methods, including ICWS, in classification and top-$K$ retrieval, after relieving the underlying dependence.
翻译:Min-Hash是高效估计二进制相近的圆形相近性的一种流行技术。 一致加权抽样(CWS)将Min-Hash计划笼统地概括为草图加权套件,并吸引了社区越来越多的兴趣。 由于不断的复杂程度与加权值无关,改进的CWS(ICWS)被视为最先进的CWS算法。 在本文中,我们重新审视ICWS并分析其基本机制,以表明ICWS产生的散装码的两个组成部分之间确实存在依赖性,这违反了独立的条件。 为了纠正问题,我们建议采用改进的ICWS(I$2$CWS)算法,该算法不仅与ICWS具有相同的理论复杂性,而且还符合CWS计划所要求的条件。 一些合成数据集和真实世界文本数据的实验结果显示,我们的I$2$CWS算法可以更准确地估计CACC的相似性,并且与比较方法竞争或超出,包括ICWS, IMS, 和顶级的SA-S。