The rapid expansion of online fashion platforms has created an increasing demand for intelligent recommender systems capable of understanding both visual and textual cues. This paper proposes a hybrid multimodal deep learning framework for fashion recommendation that jointly addresses two key tasks: outfit compatibility prediction and complementary item retrieval. The model leverages the visual and textual encoders of the CLIP architecture to obtain joint latent representations of fashion items, which are then integrated into a unified feature vector and processed by a transformer encoder. For compatibility prediction, an "outfit token" is introduced to model the holistic relationships among items, achieving an AUC of 0.95 on the Polyvore dataset. For complementary item retrieval, a "target item token" representing the desired item description is used to retrieve compatible items, reaching an accuracy of 69.24% under the Fill-in-the-Blank (FITB) metric. The proposed approach demonstrates strong performance across both tasks, highlighting the effectiveness of multimodal learning for fashion recommendation.
翻译:在线时尚平台的快速扩张催生了对能够同时理解视觉与文本线索的智能推荐系统的日益增长的需求。本文提出了一种用于时尚推荐的混合多模态深度学习框架,该框架联合处理两个关键任务:服装搭配兼容性预测与互补单品检索。该模型利用CLIP架构的视觉与文本编码器获取时尚物品的联合潜在表示,随后将其整合为统一特征向量并由Transformer编码器处理。对于兼容性预测,引入“搭配令牌”以建模物品间的整体关系,在Polyvore数据集上实现了0.95的AUC值。对于互补单品检索,使用代表目标物品描述的“目标物品令牌”来检索兼容物品,在填空(FITB)指标下达到了69.24%的准确率。所提出的方法在两项任务中均表现出强劲性能,凸显了多模态学习在时尚推荐中的有效性。