Vector similarity search has become a critical component in AI-driven applications such as large language models (LLMs). To achieve high recall and low latency, GPUs are utilized to exploit massive parallelism for faster query processing. However, as the number of vectors continues to grow, the graph size quickly exceeds the memory capacity of a single GPU, making it infeasible to store and process the entire index on a single GPU. Recent work uses CPU-GPU architectures to keep vectors in CPU memory or SSDs, but the loading step stalls GPU computation. We present Fantasy, an efficient system that pipelines vector search and data transfer in a GPU cluster with GPUDirect Async. Fantasy overlaps computation and network communication to significantly improve search throughput for large graphs and deliver large query batch sizes.
翻译:向量相似性搜索已成为人工智能驱动应用(如大型语言模型LLMs)中的关键组件。为实现高召回率与低延迟,通常利用GPU的大规模并行性来加速查询处理。然而,随着向量数量持续增长,图规模迅速超出单GPU内存容量,导致无法在单GPU上存储和处理完整索引。近期研究采用CPU-GPU架构将向量存储于CPU内存或SSD中,但数据加载步骤会阻塞GPU计算。本文提出Fantasy系统,该系统通过GPUDirect Async技术在GPU集群中实现向量搜索与数据传输的流水线化。Fantasy通过重叠计算与网络通信,显著提升大规模图搜索的吞吐量,并支持大容量查询批处理。