Taobao Search consists of two phases: the retrieval phase and the ranking phase. Given a user query, the retrieval phase returns a subset of candidate products for the following ranking phase. Recently, the paradigm of pre-training and fine-tuning has shown its potential in incorporating visual clues into retrieval tasks. In this paper, we focus on solving the problem of text-to-multimodal retrieval in Taobao Search. We consider that users' attention on titles or images varies on products. Hence, we propose a novel Modal Adaptation module for cross-modal fusion, which helps assigns appropriate weights on texts and images across products. Furthermore, in e-commerce search, user queries tend to be brief and thus lead to significant semantic imbalance between user queries and product titles. Therefore, we design a separate text encoder and a Keyword Enhancement mechanism to enrich the query representations and improve text-to-multimodal matching. To this end, we present a novel vision-language (V+L) pre-training methods to exploit the multimodal information of (user query, product title, product image). Extensive experiments demonstrate that our retrieval-specific pre-training model (referred to as MAKE) outperforms existing V+L pre-training methods on the text-to-multimodal retrieval task. MAKE has been deployed online and brings major improvements on the retrieval system of Taobao Search.
翻译:Taobao 搜索由两个阶段组成: 检索阶段和排名阶段。 用户询问后, 检索阶段返回了下一个排名阶段的一组候选产品。 最近, 培训前和微调的范例展示了将视觉线索纳入检索任务中的潜力。 在本文中, 我们侧重于解决在道保搜索中文本到多式检索的问题。 我们认为用户对标题或图像的关注因产品而异。 因此, 我们为跨模式融合提出了一个新的模式适应模块, 这有助于对文本和图像进行适当的加权。 此外, 在电子商务搜索中, 用户询问往往很简短, 从而导致用户查询和产品标题之间的语义不平衡。 因此, 我们设计了一个单独的文本编码和关键词“加强”机制, 以丰富查询表达方式和改进文本到多式匹配。 为此, 我们提出了一个新的愿景语言( V+L) 预培训方法, 以利用( 用户查询、 产品标题、 产品图像) 的多式联运信息。 此外, 在电子商业搜索中, 用户询问往往很简短, 从而导致用户查询和产品标题之间的语义严重不平衡。 因此, 我们的检索前系统前的检索模式已经将主要任务升级模式带到了VBATO 。