Baidu搜索网络规模检索访问预培训语言模式 (Pre-trained Language Model for Web-scale Retrieval in Baidu Search)

Retrieval is a crucial stage in web search that identifies a small set of query-relevant candidates from a billion-scale corpus. Discovering more semantically-related candidates in the retrieval stage is very promising to expose more high-quality results to the end users. However, it still remains non-trivial challenges of building and deploying effective retrieval models for semantic matching in real search engine. In this paper, we describe the retrieval system that we developed and deployed in Baidu Search. The system exploits the recent state-of-the-art Chinese pretrained language model, namely Enhanced Representation through kNowledge IntEgration (ERNIE), which facilitates the system with expressive semantic matching. In particular, we developed an ERNIE-based retrieval model, which is equipped with 1) expressive Transformer-based semantic encoders, and 2) a comprehensive multi-stage training paradigm. More importantly, we present a practical system workflow for deploying the model in web-scale retrieval. Eventually, the system is fully deployed into production, where rigorous offline and online experiments were conducted. The results show that the system can perform high-quality candidate retrieval, especially for those tail queries with uncommon demands. Overall, the new retrieval system facilitated by pretrained language model (i.e., ERNIE) can largely improve the usability and applicability of our search engine.

翻译：在网络搜索中,检索是一个关键阶段,可以识别来自10亿级星体的少量与查询相关的候选人。在检索阶段发现更精密相关候选人非常有希望向终端用户披露更高质量的结果。然而,在实际搜索引擎中,建立和部署有效的语义匹配检索模型仍然是非三重挑战。在本文中,我们描述了我们在Baidu搜索中开发并部署的检索系统。这个系统利用了最新的最先进的中国预先培训语言模型,即通过 kNowledge IntEgration (ERNIE) 增强代表性,这为系统提供了表达语义匹配的便利。特别是,我们开发了以ERNIE为基础的检索模型,该模型配备了1) 表达式变异式的语义识别器,2 是一个全面的多阶段培训模型。更重要的是,我们为在网络规模检索中部署该模型提供了一个实用的系统工作流程。最后,该系统被完全安装在生产中,通过 kNowledgege Intregard(ERI) 。结果显示,系统可以进行高品质的候选人检索,特别是高质量的候选人检索。