Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers and have achieved state-of-the-art performance on various retrieval tasks. These methods, however, are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts. In this paper, we unify different multi-vector retrieval models from a token routing viewpoint and propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval. CITADEL learns to route different token vectors to the predicted lexical ``keys'' such that a query token vector only interacts with document token vectors routed to the same key. This design significantly reduces the computation cost while maintaining high accuracy. Notably, CITADEL achieves the same or slightly better performance than the previous state of the art, ColBERT-v2, on both in-domain (MS MARCO) and out-of-domain (BEIR) evaluations, while being nearly 40 times faster. Code and data are available at https://github.com/facebookresearch/dpr-scale.
翻译:多矢量检索方法结合了稀有(例如BM25)和密集(例如DPR)检索器的优点,并取得了各种检索任务的最新性能。但是,这些方法数量级较慢,比单矢量对应方要少得多,需要更多空间储存指数。在本文中,我们从象征性路由角度统一不同的多矢量检索模型,并提议通过动态法路由(即CITADEL)进行有条件的象征性互动,即CITADEL,以便高效率和高效力的多矢量检索。CITADEL学会将不同象征性矢量运到预测的词汇“Keyes”的状态,这样,查询标量矢量仅与选择同一键的文档标量矢量相互作用。这种设计在保持高精度的同时大大降低了计算成本。值得注意的是,CITADEL在现场(MS MARCO)和外部-DOmain(BIR)评价方面都取得了相同或略优于以往状态的成绩。代码和数据可以在 http://comprreface/salsalsal.