Retrieval Augmented Generation (RAG) uses vector databases to expand the expertise of an LLM model without having to retrain it. This idea can be applied over data lakes, leading to the notion of embeddings data lakes, i.e., a pool of vector databases ready to be used by RAGs. The key component in these systems is the indexes enabling Approximated Nearest Neighbor Search (ANNS). However, in data lakes, one cannot realistically expect to build indexes for every possible dataset. In this paper, we propose an adaptive, partition-based index, CrackIVF, that performs much better than up-front index building. CrackIVF starts answering queries by near brute force search and only expands as it sees enough queries. It does so by progressively adapting the index to the query workload. That way, queries can be answered right away without having to build a full index first. After seeing enough queries, CrackIVF will produce an index comparable to the best of those built using conventional techniques. As the experimental evaluation shows, CrackIVF can often answer more than 1 million queries before other approaches have even built the index and can start answering queries immediately, achieving 10-1000x faster initialization times. This makes it ideal when working with cold data or infrequently used data or as a way to bootstrap access to unseen datasets.
翻译:暂无翻译