Accurate news representation is critical for news recommendation. Most of existing news representation methods learn news representations only from news texts while ignore the visual information in news like images. In fact, users may click news not only because of the interest in news titles but also due to the attraction of news images. Thus, images are useful for representing news and predicting user behaviors. In this paper, we propose a multimodal news recommendation method, which can incorporate both textual and visual information of news to learn multimodal news representations. We first extract region-of-interests (ROIs) from news images via objective detection. Then we use a pre-trained visiolinguistic model to encode both news texts and news image ROIs and model their inherent relatedness using co-attentional Transformers. In addition, we propose a crossmodal candidate-aware attention network to select relevant historical clicked news for accurate user modeling by measuring the crossmodal relatedness between clicked news and candidate news. Experiments validate that incorporating multimodal news information can effectively improve news recommendation.
翻译:准确的新闻介绍对新闻建议至关重要。大多数现有的新闻介绍方法只从新闻文本中学习新闻介绍,而忽略像图像这样的新闻的视觉信息。事实上,用户可以点击新闻,不仅因为对新闻标题感兴趣,而且因为吸引了新闻图像。因此,图像对代表新闻和预测用户行为有用。在本文中,我们提出一个多式联运新闻建议方法,其中可以包括新闻的文字和视觉信息,以学习多式新闻介绍。我们首先通过客观的发现从新闻图像中提取区域利益(ROIs),然后我们使用预先训练的粘合语言模型来编码新闻文本和新闻图像ROIs,并用共同注意的变换器来模拟其内在关联性。此外,我们提议建立一个跨模式候选人注意网络,通过测量点击的新闻和候选新闻之间的交叉模式来选择相关的历史点击新闻,用于准确的用户模型。实验证实,将多式联运新闻信息纳入可以有效地改进新闻建议。