与有抵触性的双重学习相比, 更强健的常量检索 (More Robust Dense Retrieval with Contrastive Dual Learning)

Dense retrieval conducts text retrieval in the embedding space and has shown many advantages compared to sparse retrieval. Existing dense retrievers optimize representations of queries and documents with contrastive training and map them to the embedding space. The embedding space is optimized by aligning the matched query-document pairs and pushing the negative documents away from the query. However, in such training paradigm, the queries are only optimized to align to the documents and are coarsely positioned, leading to an anisotropic query embedding space. In this paper, we analyze the embedding space distributions and propose an effective training paradigm, Contrastive Dual Learning for Approximate Nearest Neighbor (DANCE) to learn fine-grained query representations for dense retrieval. DANCE incorporates an additional dual training object of query retrieval, inspired by the classic information retrieval training axiom, query likelihood. With contrastive learning, the dual training object of DANCE learns more tailored representations for queries and documents to keep the embedding space smooth and uniform, thriving on the ranking performance of DANCE on the MS MARCO document retrieval task. Different from ANCE that only optimized with the document retrieval task, DANCE concentrates the query embeddings closer to document representations while making the document distribution more discriminative. Such concentrated query embedding distribution assigns more uniform negative sampling probabilities to queries and helps to sufficiently optimize query representations in the query retrieval task. Our codes are released at https://github.com/thunlp/DANCE.

翻译：密集检索在嵌入空间进行文字检索,并显示出与稀释空间相比的诸多优势。现有密密的检索器优化了查询和文件的表达方式, 进行了对比培训, 并将它们映射到嵌入空间。嵌入空间通过对匹配的查询文件配对优化, 将负文档推离查询空间。但是, 在这种培训模式中, 查询仅优化到与文件一致, 且位置粗糙, 导致厌异质查询嵌入空间。在本文中, 我们分析嵌入空间分布, 并提出一个有效的培训模式: 近近邻( 丹斯特) 的对比性双重学习, 以学习精细微的查询表达方式进行优化。在典型的信息检索培训的启发下, 疏导性检索将更多的双重培训对象“ 查询” 与文件匹配一致, 通过对比性学习, 德涅斯特的双重培训对象学习了更有针对性的表达方式, 以保持空间的平稳和统一嵌入空间的嵌入空间分配。我们的排序表现方式, 不同于用于最接近近邻近邻(丹) 精确的双重检索,,, 在更精确的排序中, 选择中, 在更精确的排序中, 调配中, 更精确的复制到更精确的排序中, 将文档的复制到更精确的复制到更精确的分发任务中,, 调配到更精确的复制到更精确的排列到更精确的分发任务。