News representation and user-oriented modeling are both essential for news recommendation. Most existing methods are based on textual information but ignore the visual information and users' dynamic interests. However, compared to textual only content, multimodal semantics is beneficial for enhancing the comprehension of users' temporal and long-lasting interests. In our work, we propose a vision-linguistics coordinate time sequence news recommendation. Firstly, a pretrained multimodal encoder is applied to embed images and texts into the same feature space. Then the self-attention network is used to learn the chronological sequence. Additionally, an attentional GRU network is proposed to model user preference in terms of time adequately. Finally, the click history and user representation are embedded to calculate the ranking scores for candidate news. Furthermore, we also construct a large scale multimodal news recommendation dataset V-MIND. Experimental results show that our model outperforms baselines and achieves SOTA on our independently constructed dataset.
翻译:大部分现有方法基于文字信息,但忽略视觉信息和用户的动态利益。然而,与仅有文字内容相比,多式联运语义有助于增进对用户时间和长期利益的理解。在我们的工作中,我们提议了一种视觉语言来协调时间序列新闻建议。首先,将预先培训的多式联运编码器用于将图像和文本嵌入同一地貌空间。然后,将自控网络用于学习时间顺序。此外,建议关注的GRU网络在时间方面充分模拟用户偏好。最后,点击历史和用户代表器嵌入了计算候选新闻的分数。此外,我们还建立了一个大型多式新闻建议数据集V-MIND。实验结果显示,我们的模型比基线和文本更符合基准,并在独立构建的数据集上实现SOTA。