State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. When the query data is drawn from a different distribution - e.g., when index represents image embeddings and query represents textual embeddings - such algorithms lose much of this performance advantage. On a variety of datasets, for a fixed recall target, latency is worse by an order of magnitude or more for Out-Of-Distribution (OOD) queries as compared to In-Distribution (ID) queries. The question we address in this work is whether ANNS algorithms can be made efficient for OOD queries if the index construction is given access to a small sample set of these queries. We answer positively by presenting OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint. OOD-DiskANN is scalable and has the efficiency of graph-based ANNS indices. Some of our contributions can improve query efficiency for ID queries as well.
翻译:近距离近邻搜索( ANNS) 的最新算法, 如 DiskANN、 FAISIS- IVF 和 HNSW 等 DiskANN、 FASIS- IVF 和 HNSW 等数据集依赖指数, 其精确度和搜索效率大大高于数据分配指数的分布。 当查询数据来自不同的分布 - 例如, 当索引代表图像嵌入和查询代表文本嵌入时 - 这种算法失去了大部分的性能优势。 在一系列数据集中, 对于固定的调回目标来说, 与分配(ID) 查询相比, 长期性更差, 数据依赖性指数的大小或更大。 我们在此工作中要解决的问题是: 当索引构建能代表图像嵌入图像和查询的少量样本时, 查询是否有效 。 我们通过展示 OOODD- DiskANNNNN 来做出肯定的答复, 它使用保留样本( 占指数设定大小的1%) 来查找 OOODD( OOOD) ) 查询中某些40% 的可理解性读性读取的 ONADADDDADD, 可以 查询 改进 改进 。