重新审查改进一致的加权抽样 (Improved Consistent Weighted Sampling Revisited)

Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the community. Due to its constant-time complexity independent of the values of the weights, Improved CWS (ICWS) is considered as the state-of-the-art CWS algorithm. In this paper, we revisit ICWS and analyze its underlying mechanism to show that there actually exists dependence between the two components of the hash-code produced by ICWS, which violates the condition of independence. To remedy the problem, we propose an Improved ICWS (I$^2$CWS) algorithm which not only shares the same theoretical computational complexity as ICWS but also abides by the required conditions of the CWS scheme. The experimental results on a number of synthetic data sets and real-world text data sets demonstrate that our I$^2$CWS algorithm can estimate the Jaccard similarity more accurately, and also compete with or outperform the compared methods, including ICWS, in classification and top-$K$ retrieval, after relieving the underlying dependence.

翻译：Min-Hash是高效估计二进制相近的圆形相近性的一种流行技术。一致加权抽样(CWS)将Min-Hash计划笼统地概括为草图加权套件,并吸引了社区越来越多的兴趣。由于不断的复杂程度与加权值无关,改进的CWS(ICWS)被视为最先进的CWS算法。在本文中,我们重新审视ICWS并分析其基本机制,以表明ICWS产生的散装码的两个组成部分之间确实存在依赖性,这违反了独立的条件。为了纠正问题,我们建议采用改进的ICWS(I$2$CWS)算法,该算法不仅与ICWS具有相同的理论复杂性,而且还符合CWS计划所要求的条件。一些合成数据集和真实世界文本数据的实验结果显示,我们的I$2$CWS算法可以更准确地估计CACC的相似性,并且与比较方法竞争或超出,包括ICWS, IMS, 和顶级的SA-S。

相关内容

ICWS

关注 0

国际Web服务会议（ICWS）是研究人员和行业从业人员交流基于Web服务的最新技术和实践进展、确定新的研究主题和定义基于Web服务的未来的主要国际论坛。所有关于基于Web的服务生命周期研究和管理的主题都与ICWS的主题保持一致。2019年，该会议将努力推进规模最大的互联网/网络服务国际专业论坛。官网链接：https://conferences.computer.org/icws/2019/

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日