产品1M:通过跨模式预科培训,向弱势受监督的实一级产品回收迈进 (Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining)

Nowadays, customer's demands for E-commerce are more diversified, which introduces more complications to the product retrieval industry. Previous methods are either subject to single-modal input or perform supervised image-level product retrieval, thus fail to accommodate real-life scenarios where enormous weakly annotated multi-modal data are present. In this paper, we investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval among fine-grained product categories. To promote the study of this challenging task, we contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval. Notably, Product1M contains over 1 million image-caption pairs and consists of two sample types, i.e., single-product and multi-product samples, which encompass a wide variety of cosmetics brands. In addition to the great diversity, Product1M enjoys several appealing characteristics including fine-grained categories, complex combinations, and fuzzy correspondence that well mimic the real-world scenes. Moreover, we propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE), that excels in capturing the potential synergy between multi-modal inputs via a hybrid-stream transformer in a self-supervised manner.CAPTURE generates discriminative instance features via masked multi-modal learning as well as cross-modal contrastive pretraining and it outperforms several SOTA cross-modal baselines. Extensive ablation studies well demonstrate the effectiveness and the generalization capacity of our model.

翻译：目前,客户对电子商务的需求更加多样化,这给产品回收行业带来了更多的复杂问题。以往的方法要么是单一模式投入,要么是进行监督的图像级产品检索,因此无法适应存在大量微弱附加说明的多模式数据的真实生活情景。在本文中,我们调查了一个更现实的环境,目的是在微粒产品类别中进行监管薄弱的多模式级产品检索。为了促进对这项具有挑战性的任务的研究,我们提供了产品1M,这是用于真实世界实例级检索的最大多模式化妆品数据集之一。值得注意的是, 产品1M包含100多万个图像缩放配对,由两种样本类型组成,即单产品和多产品样本。其中包括多种多样的化妆品品牌。产品1M具有若干令人兴奋的特点,包括精细的模型类别、复杂的组合以及跨模式的跨模式化妆品级化妆品数据集。此外,我们提议了一个名为跨模式的跨模式图像级图像级组合组合组合组合, 将一个名为跨模式的模型,作为我们通用的变现的变现工具级变现工具级变现工具级的模型。