While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).
翻译:虽然在任务和语言之间显示了大量有效且高效的检索,但当没有相关标签时,仍然难以创建有效的完全零点密集的检索系统。 在本文中, 我们认识到零点学习和编码关联的难度。 相反, 我们提议通过假设文档嵌入到( HyDE) 。 根据一个查询, HyDE 第一个零点指示一个遵循指令的语言模型( 如 指示GPT ) 来生成一个假设文档。 文件捕捉关联性模式, 但是不真实, 可能包含错误的细节 。 然后, 一个非监督性、 对比性学习的 encoder~ (例如 Contriever) 将文档编码成嵌入矢量。 这个矢量点会发现一个嵌入空间的邻里, 以矢量类似文件的类似 。 这个第二步将生成的文档定位为实际文体, 其密质的瓶端过滤器过滤错误的细节 。 我们的实验显示 HyDE 明显超越了状态的不准确性 。 (例如 ) 检索器搜索器搜索器 和显示强烈的功能 。