With the development of learning-based embedding models, embedding vectors are widely used for analyzing and searching unstructured data. As vector collections exceed billion-scale, fully managed and horizontally scalable vector databases are necessary. In the past three years, through interaction with our 1200+ industry users, we have sketched a vision for the features that next-generation vector databases should have, which include long-term evolvability, tunable consistency, good elasticity, and high performance. We present Manu, a cloud native vector database that implements these features. It is difficult to integrate all these features if we follow traditional DBMS design rules. As most vector data applications do not require complex data models and strong data consistency, our design philosophy is to relax the data model and consistency constraints in exchange for the aforementioned features. Specifically, Manu firstly exposes the write-ahead log (WAL) and binlog as backbone services. Secondly, write components are designed as log publishers while all read-only analytic and search components are designed as independent subscribers to the log services. Finally, we utilize multi-version concurrency control (MVCC) and a delta consistency model to simplify the communication and cooperation among the system components. These designs achieve a low coupling among the system components, which is essential for elasticity and evolution. We also extensively optimize Manu for performance and usability with hardware-aware implementations and support for complex search semantics.
翻译:随着基于学习的嵌入模型的开发,嵌入矢量被广泛用于分析和搜索非结构化的数据。由于矢量的收集超过10亿尺度,充分管理和横向可缩放的矢量数据库是必要的。在过去三年里,通过与我们的1200+工业用户的互动,我们为下一代矢量数据库应该具备的特征绘制了远景图,其中包括长期的可变性、金枪鱼易变性、良好的弹性和高性能。我们介绍了一个云层本地矢量数据库Manu,这个数据库可以实施这些特征。如果我们遵循传统的DBMSS设计规则,那么很难整合所有这些特征。由于大多数矢量数据应用不需要复杂的数据模型和强大的数据一致性,因此我们的设计理念是放松数据模型和一致性限制,以交换上述特征。具体地说,Manu首先将书写首行日志(WAL)和 binlog作为主干服务。第二,将构件设计成日志出版商,而所有只读的本地矢量和搜索组件都设计成为日志服务的独立用户。最后,我们使用多变调调调调调调调调调调调调调制的系统组件(MVACTI)和三角系统进行操作的操作,从而实现基本的软化。