This paper studies the multi-modal recommendation problem, where the item multi-modality information (e.g., images and textual descriptions) is exploited to improve the recommendation accuracy. Besides the user-item interaction graph, existing state-of-the-art methods usually use auxiliary graphs (e.g., user-user or item-item relation graph) to augment the learned representations of users and/or items. These representations are often propagated and aggregated on auxiliary graphs using graph convolutional networks, which can be prohibitively expensive in computation and memory, especially for large graphs. Moreover, existing multi-modal recommendation methods usually leverage randomly sampled negative examples in Bayesian Personalized Ranking (BPR) loss to guide the learning of user/item representations, which increases the computational cost on large graphs and may also bring noisy supervision signals into the training process. To tackle the above issues, we propose a novel self-supervised multi-modal recommendation model, dubbed BM3, which requires neither augmentations from auxiliary graphs nor negative samples. Specifically, BM3 first bootstraps latent contrastive views from the representations of users and items with a simple dropout augmentation. It then jointly optimizes three multi-modal objectives to learn the representations of users and items by reconstructing the user-item interaction graph and aligning modality features under both inter- and intra-modality perspectives. BM3 alleviates both the need for contrasting with negative examples and the complex graph augmentation from an additional target network for contrastive view generation. We show BM3 outperforms prior recommendation models on three datasets with number of nodes ranging from 20K to 200K, while achieving a 2-9X reduction in training time. Our code is available at https://github.com/enoche/BM3.
翻译:本文研究了多模态推荐问题,其中使用物品的多模态信息(例如,图像和文本描述)来提高推荐准确性。除了用户-项交互图外,现有的最先进方法通常使用辅助图(例如,用户-用户或物品-物品关系图)来增强用户和/或物品的学习表示。这些表示通常使用图卷积网络在辅助图上进行传播和聚合,对于大型图形来说,这可能在计算和内存方面都过于昂贵。此外,现有的多模态推荐方法通常利用从贝叶斯个性化排名(BPR)损失中随机抽样的负样本来指导用户/物品表示的学习,这会增加大型图形上的计算成本,并可能在训练过程中带来噪声监督信号。为解决上述问题,我们提出了一种新的自监督多模态推荐模型,称为BM3,它既不需要来自辅助图的增强,也不需要负样本。具体而言,BM3首先使用一种简单的dropout增强从用户和物品的表示中自助引导潜在对比视图。然后,它通过在跨模态和内模态视角下重构用户-项交互图并对齐模态特征来共同优化三个多模态目标,从而学习用户和物品的表示。BM3减轻了与负例的对比和额外目标网络用于对比视图生成的复杂图形增强的需要。我们在三个范围从20K到200K的数据集上展示了BM3优于先前的推荐模型,同时在训练时间上实现了2-9倍的减少。我们的代码可在 https://github.com/enoche/BM3 上获取。