深入研究基于视觉语言预训练的电子商务产品检索方法 (Delving into E-Commerce Product Retrieval with Vision-Language Pre-training)

E-commerce search engines comprise a retrieval phase and a ranking phase, where the first one returns a candidate product set given user queries. Recently, vision-language pre-training, combining textual information with visual clues, has been popular in the application of retrieval tasks. In this paper, we propose a novel V+L pre-training method to solve the retrieval problem in Taobao Search. We design a visual pre-training task based on contrastive learning, outperforming common regression-based visual pre-training tasks. In addition, we adopt two negative sampling schemes, tailored for the large-scale retrieval task. Besides, we introduce the details of the online deployment of our proposed method in real-world situations. Extensive offline/online experiments demonstrate the superior performance of our method on the retrieval task. Our proposed method is employed as one retrieval channel of Taobao Search and serves hundreds of millions of users in real time.

翻译：电子商务搜索引擎包括检索阶段和排序阶段，其中前者根据用户查询返回候选产品集。最近，结合文本信息和视觉线索的视觉语言预训练在检索任务中应用越来越流行。本文提出了一种新颖的V+L预训练方法，用于解决淘宝搜索中的检索问题。我们设计了一种基于对比学习的视觉预训练任务，优于常见的基于回归的视觉预训练任务。此外，我们采用了两种针对大规模检索任务的负采样方案。此外，我们介绍了我们提出的方法在实际情况下的在线部署细节。广泛的离线/在线实验证明了我们的方法在检索任务上的卓越性能。我们提出的方法被用作淘宝搜索的一个检索通道，并实时为数亿用户提供服务。