学习电子商务中的大规模多模态预训练的实例级表示 (Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce)

This paper aims to establish a generic multi-modal foundation model that has the scalable capability to massive downstream applications in E-commerce. Recently, large-scale vision-language pretraining approaches have achieved remarkable advances in the general domain. However, due to the significant differences between natural and product images, directly applying these frameworks for modeling image-level representations to E-commerce will be inevitably sub-optimal. To this end, we propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work. In detail, we craft a decoder architecture that introduces a set of learnable instance queries to explicitly aggregate instance-level semantics. Moreover, to enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are further proposed. Pretrained on the 100 million E-commerce-related data, ECLIP successfully extracts more generic, semantic-rich, and robust representations. Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.

翻译：本文旨在建立一个通用的多模态基础模型，具有在电子商务中大规模下游应用的可扩展能力。最近，在一般领域中，大规模的视觉语言预训练方法取得了显著的进展。然而，由于自然图像和产品图像之间存在重大差异，直接将这些框架应用于建模电子商务图像级表示将不可避免地存在亚优化。因此，我们提出了一种实例为中心的多模态预训练范式，称为 ECLIP。具体而言，我们构建了一个解码器体系结构，引入一组可学习的实例查询以明确聚合实例级语意。此外，为了使模型能够专注于所需的产品实例而不依赖于昂贵的手动注释，进一步提出了两个特别配置的前置任务。在1亿个与电子商务相关的数据上预训练，ECLIP成功提取了更通用、语义丰富、鲁棒性更好的表示。广泛的实验结果表明，ECLIP在广泛的下游任务上均超过现有方法，而无需进一步微调，展示了其对实际电子商务应用的强大可转移性。

相关内容

电子商务

关注 2

电子商务（ Electronic Commerce）的定义： 电子商务是利用计算机技术、网络技术和远程通信技术，实现电子化、数字化和网络化的整个商务过程。　　联合国国际贸易程序简化工作组对电子商务的定义是：采用电子形式开展商务活动，它包括在供应商、客户、政府及其他参与方之间通过任何电子工具，如 EDI、 Web技术、电子邮件等共享非结构化商务信息，并管理和完成在商务活动、管理活动和消费活动中的各种交易。

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【伯克利】元学习的元基线，A New Meta-Baseline for Few-Shot Learning

专知会员服务

67+阅读 · 2020年3月28日