打字错误感知瓶颈预训练用于稳健密集检索 (Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval)

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit{fine-tuning} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel \textit{pre-training} strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

翻译：当前密集检索器(DR)在有效处理拼写错误查询方面存在局限性，而这种查询构成了商业搜索引擎查询流量的重要部分。主要问题是DR使用的预训练语言模型编码器通常是使用干净、质量良好的文本数据进行训练和微调的。拼写错误的查询通常不会出现在用于训练这些模型的数据中，因此观察到的拼写错误的查询在分布上与用于训练和微调的数据不符。以前解决这个问题的尝试集中在\textit{微调}策略上，但它们对拼写错误的查询的有效性仍然比使用单独的最先进拼写检查组件的管道低。为了解决这个挑战，我们提出了ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval)，一种新颖的DR预训练策略，该策略提高了DR对拼写错误查询的鲁棒性，同时保留了它们在下游检索任务中的有效性。ToRoDer利用编码器-解码器架构，其中编码器接受带有掩码令牌的拼写错误文本作为输入，并输出瓶颈信息以供解码器使用。解码器然后将瓶颈嵌入作为输入，以及原始文本的令牌嵌入，其中拼写错误的令牌被掩盖。预训练任务是恢复编码器和解码器的掩码令牌。我们广泛的实验结果和详细的消融研究表明，使用ToRoDer预训练的DR在拼写错误的查询有效性方面显着提高，合理地缩小了与使用单独的复杂拼写检查器组件的管道之间的差距，同时保留了它们在正确拼写的查询上的有效性。