时尚SAP：符号和属性提示的细粒度时尚视觉语言预训练 (FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training)

Fashion vision-language pre-training models have shown efficacy for a wide range of downstream tasks. However, general vision-language pre-training models pay less attention to fine-grained domain features, while these features are important in distinguishing the specific domain tasks from general tasks. We propose a method for fine-grained fashion vision-language pre-training based on fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained multi-modalities fashion attributes and characteristics. Firstly, we propose the fashion symbols, a novel abstract fashion concept layer, to represent different fashion items and to generalize various kinds of fine-grained fashion features, making modelling fine-grained attributes more effective. Secondly, the attributes prompt method is proposed to make the model learn specific attributes of fashion items explicitly. We design proper prompt templates according to the format of fashion data. Comprehensive experiments are conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and FashionSAP gets SOTA performances for four popular fashion tasks. The ablation study also shows the proposed abstract fashion symbols, and the attribute prompt method enables the model to acquire fine-grained semantics in the fashion domain effectively. The obvious performance gains from FashionSAP provide a new baseline for future fashion task research.

翻译：时尚视觉语言预训练模型已经在广泛的下游任务中显示出了效果。然而，通用视觉语言预训练模型更少关注细粒度的领域特征，而这些特征在区分特定领域任务与通用任务方面非常重要。我们提出了一种基于时尚符号和属性提示（FashionSAP）的细粒度时尚视觉语言预训练方法，以建模细粒度多模式时尚属性和特征。首先，我们提出时尚符号，一种新的抽象时尚概念层，用于表示不同的时尚物品并概括各种细粒度时尚特征，使对细粒度属性的建模更加有效。其次，我们提出了属性提示方法来显式地学习时尚物品的特定属性。我们根据时尚数据的格式设计适当的提示模板。我们在两个公共时尚基准数据集上进行了全面的实验，即FashionGen和FashionIQ，并且FashionSAP在四个流行的时尚任务中获得了SOTA性能。基准研究还表明，所提出的抽象时尚符号和属性提示方法使得模型能够有效地获取时尚领域的细粒度语义。FashionSAP的明显性能收益为未来的时尚任务研究提供了新的基线。