Approximate nearest neighbor search (ANNS) is a fundamental building block in information retrieval with graph-based indices being the current state-of-the-art and widely used in the industry. Recent advances in graph-based indices have made it possible to index and search billion-point datasets with high recall and millisecond-level latency on a single commodity machine with an SSD. However, existing graph algorithms for ANNS support only static indices that cannot reflect real-time changes to the corpus required by many key real-world scenarios (e.g. index of sentences in documents, email, or a news index). To overcome this drawback, the current industry practice for manifesting updates into such indices is to periodically re-build these indices, which can be prohibitively expensive. In this paper, we present the first graph-based ANNS index that reflects corpus updates into the index in real-time without compromising on search performance. Using update rules for this index, we design FreshDiskANN, a system that can index over a billion points on a workstation with an SSD and limited memory, and support thousands of concurrent real-time inserts, deletes and searches per second each, while retaining $>95\%$ 5-recall@5. This represents a 5-10x reduction in the cost of maintaining freshness in indices when compared to existing methods.
翻译:近距离近邻搜索(ANNS)是信息检索的基本基石,以图表为基础的指数进行信息检索,以图表为基础的指数是当前最先进和该行业广泛使用的指数,最近基于图表的指数进步使得有可能在带有SSD的单一商品机器上用高回调和毫秒的悬浮度来索引和搜索10亿点数据集。然而,现有的ANNS图表算法仅支持不能反映许多关键现实世界情景(例如文件、电子邮件或新闻索引中的句号索引)所要求的对基本内容实时变化的静态指数。为了克服这一缺陷,目前显示这些指数更新的行业做法是定期重新建立这些指数,这些指数可能费用过高。在本文件中,我们提供了第一个反映实时指数中基本更新的基于图表的ANNS指数,但不影响搜索性能。我们设计了这个指数的更新规则,我们设计了“新鲜DiskANNN,这个系统可以在一个带有SSD和有限记忆的工作站上将超过10亿点的指数指数指数指数指数加到数千美元,并且支持同时实时更新的指数,在5美元/10之间将每个新的指数插入,同时保存到5美元。