带有文件扩展内容的密集检索课程表样本 (Curriculum Sampling for Dense Retrieval with Document Expansion)

The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent work expects to get query-informed representations of documents. During training, it expands the document with a real query, while replacing the real query with a generated pseudo query at inference. This discrepancy between training and inference makes the dense retrieval model pay more attention to the query information but ignore the document when computing the document representation. As a result, it even performs worse than the vanilla dense retrieval model, since its performance depends heavily on the relevance between the generated queries and the real query. In this paper, we propose a curriculum sampling strategy, which also resorts to the pseudo query at training and gradually increases the relevance of the generated query to the real query. In this way, the retrieval model can learn to extend its attention from the document only to both the document and query, hence getting high-quality query-informed document representations. Experimental results on several passage retrieval datasets show that our approach outperforms the previous dense retrieval methods1.

翻译：双编码器已成为密度检索的实际结构。通常, 它独立计算查询和文档的潜在表达方式, 从而无法完全捕捉查询和文档之间的相互作用。为了减轻这一影响, 最近的工作有望在查询的基础上获得文件的显示方式。在培训期间, 它以真实的查询方式扩展文档, 同时用生成的假查询方式取代真实的查询方式。培训和推断之间的这种差异使得密集检索模式更加关注查询信息, 在计算文件表示方式时忽略文档。结果, 它的表现甚至比香草密集检索模式还要差, 因为它的性能在很大程度上取决于生成查询和真实查询之间的关联性。在本文中, 我们提出了一个课程抽样战略, 同时也在培训时使用假查询方式, 并逐渐提高生成查询与真实查询的相关性。这样, 检索模式可以学会将其注意力从文件扩大到文件与查询, 从而获得高质量的查询文件的显示方式。几个段落检索数据设置的实验结果显示, 我们的方法超过了以前的密度检索方法。 1。