LIDER: 大规模高密度通过率检索高效高水平技术指数 (LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage Retrieval)

Text retrieval using dense embeddings generated from deep neural models is called "dense passage retrieval". Dense passage retrieval systems normally deploy a deep neural model followed by an approximate nearest neighbor (ANN) search module. The model generates text embeddings, which are then indexed by the ANN module. With the increasing data scale, the ANN module unavoidably becomes the bottleneck on efficiency, because of its linear or sublinear time complexity with data scale. An alternative is the learned index which has a theoretically constant time complexity. But most of the existing learned indexes are designed for low dimensional data. Thus they are not suitable for dense passage retrieval tasks with high-dimensional dense embeddings. We propose LIDER, an efficient high-dimensional Learned Index for large-scale DEnse passage Retrieval. LIDER has a clustering-based hierarchical architecture formed by two layers of core models. As the basic unit of LIDER to index and search data, each core model includes an adapted recursive model index (RMI) and a dimension reduction component which consists of an extended SortingKeys-LSH (SK-LSH) and a key re-scaling module. The dimension reduction component reduces the high-dimensional dense embeddings into one-dimensional keys and sorts them in a specific order, which are then used by the RMI. And the RMI consists of multiple simple linear regression models that make fast prediction in only O(1) time. We successfully optimize and combine SK-LSH and RMI together into the core model, and organize multiple core models into a two-layer structure based on a clustering-based partitioning of the whole data space. Experiments show that LIDER has a higher search speed with high retrieval quality comparing to the state-of-the-art ANN indexes commonly used in dense passage retrieval. Furthermore, LIDER has a better capability of speed-quality trade-off.

翻译：使用深神经模型生成的密集嵌入器,使用深神经模型生成的密集嵌入器进行感密的检索。感密通道检索系统通常会部署深神经模型, 并配有近近邻(ANN) 搜索模块。模型会生成文本嵌入器, 然后由 ANN 模块进行索引。随着数据规模的扩大, ANN模块不可避免地成为效率的瓶颈, 因为它的线性或亚线性时间复杂性与数据规模。另一种办法是学习的指数, 它在理论上具有恒定的时间复杂性。但大部分现有的已学指数是为低维数据设计的。因此, 它们不适合由高密度的远近邻(ANNNNN) 搜索任务。我们建议 LIDER, 一个高效的多维化的多维化指数, 由两层核心模型组成, 由 IMDER 基本单元组成, 由 R-R-LIS- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- MI 高级流流流流流和流流流- 流- 流- 和核心核心将一个的快速化、和直流- 和核心直流- 流化、核心核心将一个和直流- 系统- 系统- 、、直流- 、、、流化、流化、流化、流化、、、、、流化、、流化、、、、直流、、、流- 直、、、直、、流- 、、、直、、、、直、、、、流- R- 流- 流- 流- 流- 流- 流- 流- 流- 流- 、、、流- 、、、、流- 流- 流- 流- 、流- 流- 流- 流- 、、、流-

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

10+阅读 · 2022年3月19日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

斯坦福CS246《大数据挖掘》2021课程开始了！Jure Leskovec大牛主讲，附课程PPT下载

专知会员服务

61+阅读 · 2021年5月10日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日