This paper aims to establish a generic multi-modal foundation model that has the scalable capability to massive downstream applications in E-commerce. Recently, large-scale vision-language pretraining approaches have achieved remarkable advances in the general domain. However, due to the significant differences between natural and product images, directly applying these frameworks for modeling image-level representations to E-commerce will be inevitably sub-optimal. To this end, we propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work. In detail, we craft a decoder architecture that introduces a set of learnable instance queries to explicitly aggregate instance-level semantics. Moreover, to enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are further proposed. Pretrained on the 100 million E-commerce-related data, ECLIP successfully extracts more generic, semantic-rich, and robust representations. Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.
翻译:本文旨在建立一个通用的多模态基础模型,具有在电子商务中大规模下游应用的可扩展能力。最近,在一般领域中,大规模的视觉语言预训练方法取得了显著的进展。然而,由于自然图像和产品图像之间存在重大差异,直接将这些框架应用于建模电子商务图像级表示将不可避免地存在亚优化。因此,我们提出了一种实例为中心的多模态预训练范式,称为 ECLIP。具体而言,我们构建了一个解码器体系结构,引入一组可学习的实例查询以明确聚合实例级语意。此外,为了使模型能够专注于所需的产品实例而不依赖于昂贵的手动注释,进一步提出了两个特别配置的前置任务。在1亿个与电子商务相关的数据上预训练,ECLIP成功提取了更通用、语义丰富、鲁棒性更好的表示。广泛的实验结果表明,ECLIP在广泛的下游任务上均超过现有方法,而无需进一步微调,展示了其对实际电子商务应用的强大可转移性。