We present Tera-SLASH, an MPI (Message Passing Interface) based distributed system for approximate similarity search over tera-scale datasets. SLASH provides a multi-node implementation of the popular LSH (locality sensitive hashing) algorithm, which is generally implemented on a single machine. We offer a novel design and sketching solution to reduce the inter-machine communication overheads exponentially. In a direct comparison on comparable hardware, SLASH is more than 10000x faster than the popular LSH package in PySpark. PySpark is a widely-adopted distributed implementation of the LSH algorithm for large datasets and is deployed in commercial platforms. In the end, we show how our system scale to Tera-scale Criteo dataset with more than 4 billion samples. SLASH can index this 2.3 terabyte data over 20 nodes (on a shared cluster at Rice) in under an hour, with a query time in a fraction of milliseconds. To the best of our knowledge, there is no open-source system that can index and perform a similarity search on Criteo with a commodity cluster.
翻译:我们展示了以Tera-SLASH(Message Passing-SLASH)为基础的分布式系统,用于对梯度数据集进行近似相似的搜索。 SLASH提供流行的 LSH(地方敏感散射)算法的多节执行,通常在一台机器上实施。我们提供了一个新颖的设计和草图解决方案,以指数化减少机器间通信间接费用。在对可比硬件的直接比较中,SLASH比在PySpark的广受欢迎的 LSH软件包快10 000倍以上。PySpark是广泛采用大型数据集LSH算法的分布式实施,并部署在商业平台上。最后,我们展示了我们如何将系统规模与40亿多个样本的Tera-Criteo数据集相匹配。SLASHSASH可以在一个小时之内将2.3 兆字节的数据(在大米的一个共享的集束上)指数化为20个节点,同时以毫秒的速度进行查询。据我们所知,没有开放源系统能够对Criteo的商品集群进行索引和类似搜索。