Recently, Dense Retrieval (DR) has become a promising solution to document retrieval, where document representations are used to perform effective and efficient semantic search. However, DR remains challenging on long documents, due to the quadratic complexity of its Transformer-based encoder and the finite capacity of a low-dimension embedding. Current DR models use suboptimal strategies such as truncating or splitting-and-pooling to long documents leading to poor utilization of whole document information. In this work, to tackle this problem, we propose Segment representation learning for long documents Dense Retrieval (SeDR). In SeDR, Segment-Interaction Transformer is proposed to encode long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling and outperforms other segment-interaction patterns on DR. Since GPU memory requirements for long document encoding causes insufficient negatives for DR training, Late-Cache Negative is further proposed to provide additional cache negatives for optimizing representation learning. Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models, and confirm the effectiveness of SeDR on long document retrieval.
翻译:近来,Desense Retreival(DR)已成为一个很有希望的文件检索解决方案,因为文件的表示方式被用于切实有效地进行语义搜索;然而,由于以变换器为基础的编码器的二次复杂性和低差异嵌入的有限能力,DR仍然对长文件具有挑战性;目前的DR模型使用截断或分割和汇集等亚最佳战略,导致长期文件编码要求不足,导致整个文件信息的利用不善;为解决这一问题,我们提议为长文件Desenserieval(SeDR)学习部分代表方式。在SEMR, 部门间互动变换器建议将长文件编码成文件识别器和对部分敏感的表达方式,同时保持分解和合并的复杂性,并超越DR的其他部分互动模式。由于长文件编码要求GPU的记忆要求导致对DR培训的负差,因此进一步提议Late-Cache负为优化代表性学习提供额外的缓存。在MSMARCO和TREREC-DR 长性文件检索中,SARDRDSADSADSADSADSADSADSDSADSDSDSADSADSADSDSDSDSDSDSDSDSALDSADSADSADSDSDSADSDSDSDSDS的实验试验显示长期性。