The Min-Hashing approach to sketching has become an important tool in data analysis, information retrial, and classification. To apply it to real-valued datasets, the ICWS algorithm has become a seminal approach that is widely used, and provides state-of-the-art performance for this problem space. However, ICWS suffers a computational burden as the sketch size K increases. We develop a new Simplified approach to the ICWS algorithm, that enables us to obtain over 20x speedups compared to the standard algorithm. The veracity of our approach is demonstrated empirically on multiple datasets and scenarios, showing that our new Simplified CWS obtains the same quality of results while being an order of magnitude faster.
翻译:在数据分析、信息重审和分类方面,草图的最小化方法已成为一个重要的工具。为了将其应用于实际价值的数据集,ICWS算法已经成为一种开创性方法,被广泛使用,为问题空间提供了最先进的性能。然而,随着草图大小K的增大,ICWS承受着计算负担。我们开发了一种新的简化方法,使我们得以获得与标准算法相比的20倍以上的超速。我们方法的真实性在多个数据集和假设中得到了经验性的证明,表明我们新的简化的 CWS获得同样质量的结果,而其规模则更快。