The success of contextual word representations and advances in neural information retrieval have made dense vector-based retrieval a standard approach for passage and document ranking. While effective and efficient, dual-encoders are brittle to variations in query distributions and noisy queries. Data augmentation can make models more robust but introduces overhead to training set generation and requires retraining and index regeneration. We present Contrastive Alignment POst Training (CAPOT), a highly efficient finetuning method that improves model robustness without requiring index regeneration, the training set optimization, or alteration. CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root. We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
翻译:成功的语境词表示和神经信息检索的进展使得基于密集向量的检索成为段落和文档排序的标准方法。尽管有效且高效,但双编码器在查询分布变化和嘈杂查询中很脆弱。数据增强可以使模型更加鲁棒,但会引入培训集生成的开销,并需要重新培训和索引再生。我们提出了对比校准后训练(CAPOT),一种高效的微调方法,通过将文档编码器冻结,而查询编码器学习将嘈杂查询与其未变化的根对齐,从而提高模型的鲁棒性。我们评估了CAPOT在MSMARCO,自然问题和问答检索中的嘈杂变体,发现CAPOT具有与数据增强相似但没有其开销的影响。