Growing techniques have been emerging to improve the performance of passage retrieval. As an effective representation bottleneck pretraining technique, the contextual masked auto-encoder utilizes contextual embedding to assist in the reconstruction of passages. However, it only uses a single auto-encoding pre-task for dense representation pre-training. This study brings multi-view modeling to the contextual masked auto-encoder. Firstly, multi-view representation utilizes both dense and sparse vectors as multi-view representations, aiming to capture sentence semantics from different aspects. Moreover, multiview decoding paradigm utilizes both autoencoding and auto-regressive decoders in representation bottleneck pre-training, aiming to provide both reconstructive and generative signals for better contextual representation pretraining. We refer to this multi-view pretraining method as CoT-MAE v2. Through extensive experiments, we show that CoT-MAE v2 is effective and robust on large-scale passage retrieval benchmarks and out-of-domain zero-shot benchmarks.
翻译:随着段落检索性能的不断提高,越来越多的技巧应运而生。作为一种有效的表示瓶颈预训练技术,上下文遮掩自编码器利用上下文嵌入来辅助重建段落。然而,它仅利用单个自编码预任务进行密集表示预训练。本研究将多视图建模引入上下文遮掩自编码器。首先,多视图表示利用密集和稀疏向量作为多视图表示,旨在从不同角度捕捉句子语义。此外,多视图解码范式在表示瓶颈预训练中同时利用自编码和自回归解码器,旨在为更好的上下文表示预训练提供重建和生成信号。我们将这种多视图预训练方法称为 CoT-MAE v2。通过大量实验证明,CoT-MAE v2 在大规模段落检索基准测试和跨领域零-shot基准测试中具有有效性和鲁棒性。