Recent research demonstrates the effectiveness of using fine-tuned language models~(LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i)~fragility to training data noise and ii)~requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Retrieval experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis, or filtering, as well as the need for large batch training. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch fine-tuning.
翻译:最近的研究表明使用微调语言模型~ (LM) 来进行密集检索是有效的。 但是, 密度的检索器很难培训, 通常需要大量设计精细的管道才能充分发挥潜力。 在本文中, 我们发现并解决了密集检索器的两个根本问题 : i) 用于培训数据噪音的易变性, ii) 需要大批量来强有力地学习嵌入空间 。 我们使用最近提议的 Condenser 培训前结构, 学会通过 LM 培训前将信息浓缩到密度矢量中。 此外, 我们提议Codenser, 增加一个不受监督的系统级对比性损失, 以暖化嵌入空间的通道。 MS- MARCO、 自然问题 和 Trivia QA 数据集的检索实验显示, Comdens 消除了对重数据工程的需求, 如增强、 合成 或过滤, 以及大型批量培训的需要 。 它显示与 RockQA 相似的功能, 一种状态的、 重构件的系统, 使用简单的小批次微调。