Efficient k-nearest neighbor search is a fundamental task, foundational for many problems in NLP. When the similarity is measured by dot-product between dual-encoder vectors or $\ell_2$-distance, there already exist many scalable and efficient search methods. But not so when similarity is measured by more accurate and expensive black-box neural similarity models, such as cross-encoders, which jointly encode the query and candidate neighbor. The cross-encoders' high computational cost typically limits their use to reranking candidates retrieved by a cheaper model, such as dual encoder or TF-IDF. However, the accuracy of such a two-stage approach is upper-bounded by the recall of the initial candidate set, and potentially requires additional training to align the auxiliary retrieval model with the cross-encoder model. In this paper, we present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder. Retrieval is made efficient with CUR decomposition, a matrix decomposition approach that approximates all pairwise cross-encoder distances from a small subset of rows and columns of the distance matrix. Indexing items using our approach is computationally cheaper than training an auxiliary dual-encoder model through distillation. Empirically, for k > 10, our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods that re-rank items retrieved using a dual-encoder or TF-IDF.
翻译:K- 最接近的邻居搜索是一项基本任务, 是 NLP 中许多问题的基础。 当以双coder 矢量或 $\ ell_ 2$- 距离之间的点产品来测量相似性时, 已经存在许多可缩放和高效的搜索方法。 但是, 当以更准确和昂贵的黑箱神经相似性模型来测量相似性时, 类似性则不是如此, 例如交叉计算器, 它共同编码查询和候选邻居。 交叉计算器的高计算成本通常限制它们使用以更便宜的模式( 如双coder 或 TF- IDF ) 重新定位候选人。 但是, 这种两阶段方法的准确性被初始候选人集的回调所覆盖, 可能需要额外的培训来将辅助检索模型与交叉编码模型相匹配。 在本文中, 我们展示一种避免使用双电解码模型进行检索的方法, 仅仅依靠交叉编码器。 rerelieveval 与CUR 解码、 基质解算器- decomplation- decomplation 方法相近似, 将我们的更高级交易模型用于整个轨道的双序路路路路路段计算。