Multimodal tasks in the fashion domain have significant potential for e-commerce, but involve challenging vision-and-language learning problems - e.g., retrieving a fashion item given a reference image plus text feedback from a user. Prior works on multimodal fashion tasks have either been limited by the data in individual benchmarks, or have leveraged generic vision-and-language pre-training but have not taken advantage of the characteristics of fashion data. Additionally, these works have mainly been restricted to multimodal understanding tasks. To address these gaps, we make two key contributions. First, we propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks. Second, we propose a flexible decoder-based model architecture capable of both fashion retrieval and captioning tasks. Together, our model design and pre-training approach are competitive on a diverse set of fashion tasks, including cross-modal retrieval, image retrieval with text feedback, image captioning, relative image captioning, and multimodal categorization.
翻译:时装领域的多式任务具有巨大的电子商务潜力,但涉及具有挑战性的视觉和语言学习问题,例如,检索一个时装项目,提供参考图像,加上用户的文字反馈;多式联运时装任务以前的工作要么受到个别基准数据的限制,要么利用通用的视觉和语言培训前工作,但没有利用时装数据的特点;此外,这些工作主要限于多式联运理解任务;为弥补这些差距,我们作出了两项关键贡献;首先,我们提议了一个新的时装前培训框架,其基础是时装图像对等所建的、受微弱监督的三重任务;我们表明,基于三重任务是标准多式联运前培训任务的有效补充;第二,我们提议一个灵活的基于编码的模型结构,既能进行时装检索和说明任务;我们的模式设计和培训前方法在多种时装任务上具有竞争力,包括跨式检索、图像检索和文本反馈、图像说明、相对图像说明和多式联运分类。