E-commerce search engines comprise a retrieval phase and a ranking phase, where the first one returns a candidate product set given user queries. Recently, vision-language pre-training, combining textual information with visual clues, has been popular in the application of retrieval tasks. In this paper, we propose a novel V+L pre-training method to solve the retrieval problem in Taobao Search. We design a visual pre-training task based on contrastive learning, outperforming common regression-based visual pre-training tasks. In addition, we adopt two negative sampling schemes, tailored for the large-scale retrieval task. Besides, we introduce the details of the online deployment of our proposed method in real-world situations. Extensive offline/online experiments demonstrate the superior performance of our method on the retrieval task. Our proposed method is employed as one retrieval channel of Taobao Search and serves hundreds of millions of users in real time.
翻译:电子商务搜索引擎包括检索阶段和排名阶段,其中前者会根据用户查询返回候选产品集合。最近,将文本信息与视觉线索相结合的视觉语言预训练方法,在检索任务中得到广泛应用。本文提出了一种新颖的基于V+L预训练的方法,以解决淘宝搜索引擎中的检索问题。我们设计了一个基于对比学习的视觉预训练任务,优于常见的基于回归的视觉预训练任务。此外,我们采用了两种专为大规模检索任务量身定制的负采样方案。除此之外,我们介绍了我们的方法在实际情况下的在线部署细节。大量的离线/在线实验表明了我们的方法在检索任务中的卓越性能。我们的方法被用作淘宝搜索的一个检索通道并且能在实时中为数亿用户服务。