Sequential recommendation systems utilize the sequential interactions of users with items as their main supervision signals in learning users' preferences. However, existing methods usually generate unsatisfactory results due to the sparsity of user behavior data. To address this issue, we propose a novel pre-training framework, named Multimodal Sequence Mixup for Sequential Recommendation (MSM4SR), which leverages both users' sequential behaviors and items' multimodal content (\ie text and images) for effectively recommendation. Specifically, MSM4SR tokenizes each item image into multiple textual keywords and uses the pre-trained BERT model to obtain initial textual and visual features of items, for eliminating the discrepancy between the text and image modalities. A novel backbone network, \ie Multimodal Mixup Sequence Encoder (M$^2$SE), is proposed to bridge the gap between the item multimodal content and the user behavior, using a complementary sequence mixup strategy. In addition, two contrastive learning tasks are developed to assist M$^2$SE in learning generalized multimodal representations of the user behavior sequence. Extensive experiments on real-world datasets demonstrate that MSM4SR outperforms state-of-the-art recommendation methods. Moreover, we further verify the effectiveness of MSM4SR on other challenging tasks including cold-start and cross-domain recommendation.
翻译:序列推荐系统使用用户与物品之间的交互行为作为主要的监督信号学习用户的偏好。然而,由于用户行为数据的稀疏性,现有的方法通常会产生不理想的结果。为了解决这个问题,我们提出了一种新的预训练框架,名为"多模态序列 Mixup 序列推荐"(MSM4SR),它利用了用户顺序行为和物品的多模态内容(即文本和图像)来进行有效的推荐。具体而言,MSM4SR 使用预训练的 BERT 模型将每个物品图像分词为多个文本关键字,并获取物品的初始文本和视觉特征,以消除文本和图像模态之间的差异。提出了一种新的骨干网络,即"多模态 Mixup 序列编码器"(M$^2$SE),采用互补的序列混合策略弥合了物品多模态内容和用户行为之间的差距。此外,还开发了两个对比学习任务,以帮助 M$^2$SE 学习用户行为序列的广义多模态表示。基于真实世界数据集的大量实验表明,MSM4SR 优于现有的推荐方法。此外,我们进一步验证了 MSM4SR 在其他具有挑战性的任务上的有效性,包括冷启动和跨域推荐。