Embedding index that enables fast approximate nearest neighbor(ANN) search, serves as an indispensable component for state-of-the-art deep retrieval systems. Traditional approaches, often separating the two steps of embedding learning and index building, incur additional indexing time and decayed retrieval accuracy. In this paper, we propose a novel method called Poeem, which stands for product quantization based embedding index jointly trained with deep retrieval model, to unify the two separate steps within an end-to-end training, by utilizing a few techniques including the gradient straight-through estimator, warm start strategy, optimal space decomposition and Givens rotation. Extensive experimental results show that the proposed method not only improves retrieval accuracy significantly but also reduces the indexing time to almost none. We have open sourced our approach for the sake of comparison and reproducibility.
翻译:能够快速近距离近邻搜索的嵌入索引,是最新深层检索系统不可或缺的组成部分。传统方法,往往将嵌入学习和指数建设的两个步骤分开,产生额外的索引时间和衰减的检索准确性。在本文中,我们提出了一个名为Poeem的新颖方法,它代表基于产品量化的嵌入指数,它与深层检索模型共同培训,在端到端培训中将两个单独的步骤统一起来,方法是利用一些技术,包括梯度直通估计器、温暖启动战略、最佳空间分解和给定值旋转。广泛的实验结果显示,拟议方法不仅大大提高了检索准确性,而且还将索引时间减少到几乎零。我们为比较和可复制目的开放了我们的方法。