Tera-SLASH:一个基于LSH LSH系统,用于Tera-SLASH的Tera-SLASH搜索 (Tera-SLASH: A Distributed Energy-Efficient MPI based LSH System for Tera-Scale Similarity Search) - 专知论文

会员服务 ·

0

LSH · Criteo · PySpark · 相似度 · 簇 ·

2020 年 8 月 5 日

Tera-SLASH: A Distributed Energy-Efficient MPI based LSH System for Tera-Scale Similarity Search

翻译：Tera-SLASH:一个基于LSH LSH系统,用于Tera-SLASH的Tera-SLASH搜索

Nicholas Meisburger,Anshumali Shrivastava

We present Tera-SLASH, an MPI (Message Passing Interface) based distributed system for approximate similarity search over tera-scale datasets. SLASH provides a multi-node implementation of the popular LSH (locality sensitive hashing) algorithm, which is generally implemented on a single machine. We offer a novel design and sketching solution to reduce the inter-machine communication overheads exponentially. In a direct comparison on comparable hardware, SLASH is more than 10000x faster than the popular LSH package in PySpark. PySpark is a widely-adopted distributed implementation of the LSH algorithm for large datasets and is deployed in commercial platforms. In the end, we show how our system scale to Tera-scale Criteo dataset with more than 4 billion samples. SLASH can index this 2.3 terabyte data over 20 nodes (on a shared cluster at Rice) in under an hour, with a query time in a fraction of milliseconds. To the best of our knowledge, there is no open-source system that can index and perform a similarity search on Criteo with a commodity cluster.

翻译：我们展示了以Tera-SLASH(Message Passing-SLASH)为基础的分布式系统,用于对梯度数据集进行近似相似的搜索。 SLASH提供流行的 LSH(地方敏感散射)算法的多节执行,通常在一台机器上实施。我们提供了一个新颖的设计和草图解决方案,以指数化减少机器间通信间接费用。在对可比硬件的直接比较中,SLASH比在PySpark的广受欢迎的 LSH软件包快10 000倍以上。PySpark是广泛采用大型数据集LSH算法的分布式实施,并部署在商业平台上。最后,我们展示了我们如何将系统规模与40亿多个样本的Tera-Criteo数据集相匹配。SLASHSASH可以在一个小时之内将2.3 兆字节的数据(在大米的一个共享的集束上)指数化为20个节点,同时以毫秒的速度进行查询。据我们所知,没有开放源系统能够对Criteo的商品集群进行索引和类似搜索。

0

相关内容

LSH

局部敏感哈希算法

【SIGIR2020-NUS】解缠图协同过滤，Disentangled Graph Collaborative Filtering

【SIGIR2020-NUS】解缠图协同过滤，Disentangled Graph Collaborative Filtering

专知会员服务

60+阅读 · 2020年7月6日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

【CIKM2019 Tutorial】Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join(字符串相似性搜索与连接：数据库技术与机器学习模型的协同)，附论文免费下载

【CIKM2019 Tutorial】Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join(字符串相似性搜索与连接：数据库技术与机器学习模型的协同)，附论文免费下载

专知会员服务

10+阅读 · 2019年11月3日

【推荐系统/计算广告/机器学习/CTR预估资料汇总】

【推荐系统/计算广告/机器学习/CTR预估资料汇总】

专知会员服务

88+阅读 · 2019年10月21日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

机器学习相关资源(框架、库、软件)大列表

机器学习相关资源(框架、库、软件)大列表

专知会员服务

40+阅读 · 2019年10月9日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

LibRec 精选：AutoML for Contextual Bandits

LibRec 精选：AutoML for Contextual Bandits

LibRec智能推荐

7+阅读 · 2019年9月19日

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

神器Cobalt Strike3.13破解版

神器Cobalt Strike3.13破解版

黑白之道

12+阅读 · 2019年3月1日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Distributed Graph Convolutional Networks

Arxiv

19+阅读 · 2020年7月13日

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Arxiv

7+阅读 · 2020年3月12日

TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank

Arxiv

5+阅读 · 2019年5月17日

One-Shot Federated Learning

One-Shot Federated Learning

Arxiv

9+阅读 · 2019年3月5日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

Learning Graph Embeddings from WordNet-based Similarity Measures

Learning Graph Embeddings from WordNet-based Similarity Measures

Arxiv

4+阅读 · 2018年8月16日

Characterizing Departures from Linearity in Word Translation

Arxiv

3+阅读 · 2018年6月7日

Efficient end-to-end learning for quantizable representations

Arxiv

6+阅读 · 2018年5月15日

Scalable attribute-aware network embedding with localily

Arxiv

3+阅读 · 2018年4月17日

Large Scale Local Online Similarity/Distance Learning Framework based on Passive/Aggressive

Arxiv

5+阅读 · 2018年4月5日

VIP会员

文章信息

相关主题

相关VIP内容

【SIGIR2020-NUS】解缠图协同过滤，Disentangled Graph Collaborative Filtering

【SIGIR2020-NUS】解缠图协同过滤，Disentangled Graph Collaborative Filtering

专知会员服务

60+阅读 · 2020年7月6日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

【CIKM2019 Tutorial】Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join(字符串相似性搜索与连接：数据库技术与机器学习模型的协同)，附论文免费下载

【CIKM2019 Tutorial】Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join(字符串相似性搜索与连接：数据库技术与机器学习模型的协同)，附论文免费下载

专知会员服务

10+阅读 · 2019年11月3日

【推荐系统/计算广告/机器学习/CTR预估资料汇总】

【推荐系统/计算广告/机器学习/CTR预估资料汇总】

专知会员服务

88+阅读 · 2019年10月21日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

机器学习相关资源(框架、库、软件)大列表

机器学习相关资源(框架、库、软件)大列表

专知会员服务

40+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《俄乌战争中的无人系统：新的战争方式与新兴趋势——来自前线的印象》报告

《海上自主水面船舶远程操作中心：安全可持续运行的多维度分析》

多模态大语言模型下游调优中“保持自我”的重要性

隐身自主无人水下航行器技术如何变革水下作战并重塑海军竞争

相关资讯

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

LibRec 精选：AutoML for Contextual Bandits

LibRec 精选：AutoML for Contextual Bandits

LibRec智能推荐

7+阅读 · 2019年9月19日

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

神器Cobalt Strike3.13破解版

神器Cobalt Strike3.13破解版

黑白之道

12+阅读 · 2019年3月1日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Distributed Graph Convolutional Networks

Arxiv

19+阅读 · 2020年7月13日

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Arxiv

7+阅读 · 2020年3月12日

TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank

Arxiv

5+阅读 · 2019年5月17日

One-Shot Federated Learning

One-Shot Federated Learning

Arxiv

9+阅读 · 2019年3月5日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

Learning Graph Embeddings from WordNet-based Similarity Measures

Learning Graph Embeddings from WordNet-based Similarity Measures

Arxiv

4+阅读 · 2018年8月16日

Characterizing Departures from Linearity in Word Translation

Arxiv

3+阅读 · 2018年6月7日

Efficient end-to-end learning for quantizable representations

Arxiv

6+阅读 · 2018年5月15日

Scalable attribute-aware network embedding with localily

Arxiv

3+阅读 · 2018年4月17日

Large Scale Local Online Similarity/Distance Learning Framework based on Passive/Aggressive

Arxiv

5+阅读 · 2018年4月5日

微信扫码咨询专知VIP会员