Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g. BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pre-training task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MASTER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g. GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods. Our code and data are publicly released in https://github.com/microsoft/SimXNS
翻译:由于大多数现有方法通常采用预先训练的变压器(如BERT)进行参数初始化,因此有些工作的重点是提出新的培训前任务,以压缩从入口到密集矢量中的有用语义信息,从而取得显著的性能。然而,通过一个特定的训练前任务,有效获取关于进入密集矢量的通道的丰富的语义信息和关系,仍然具有挑战性。在这项工作中,我们提议了一个多任务预培训模型,MASTER,该模型将多种培训前任务统一和结合到瓶颈式遮盖式自动解析器结构下的不同学习目标中。具体地说,MASTER利用一个多解码结构来整合三种类型的培训前任务:腐败通道的恢复、相关通道的恢复和PLMS产出的恢复。通过一个共同的深度变压器,我们在结构中建了一个代表瓶,将大量精密的变压式智能信息压缩成数据矢量矢量式系统。