Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.
翻译:大多数文本检索器生成一个查询向量来检索相关文档。然而,查询的相关文档条件分布可能是多模态的,例如代表查询的不同解释。我们首先量化了现有检索器的局限性。评估的所有检索器在目标文档嵌入之间的距离增大时表现均更差。为应对这一局限,我们开发了一种新的检索器架构——自回归多嵌入检索器(AMER)。该模型自回归地生成多个查询向量,所有预测的查询向量均用于从语料库中检索文档。我们证明,在合成向量化数据上,所提方法能完美捕获多个目标分布,性能比单一嵌入模型提升4倍。我们还在真实世界的多答案检索数据集上微调模型并进行领域内评估。AMER在两个评估数据集上相比单一嵌入基线分别实现了4%和21%的相对增益。此外,在目标文档嵌入相似度较低的数据子集上,我们持续观察到更显著的增益。本研究展示了使用多查询向量检索器的潜力,并为未来工作开辟了新方向。