Existing explanation models generate only text for recommendations but still struggle to produce diverse contents. In this paper, to further enrich explanations, we propose a new task named personalized showcases, in which we provide both textual and visual information to explain our recommendations. Specifically, we first select a personalized image set that is the most relevant to a user's interest toward a recommended item. Then, natural language explanations are generated accordingly given our selected images. For this new task, we collect a large-scale dataset from Google Local (i.e.,~maps) and construct a high-quality subset for generating multi-modal explanations. We propose a personalized multi-modal framework which can generate diverse and visually-aligned explanations via contrastive learning. Experiments show that our framework benefits from different modalities as inputs, and is able to produce more diverse and expressive explanations compared to previous methods on a variety of evaluation metrics.
翻译:现有解释模型仅产生建议文本,但仍难于产生不同内容。 在本文中,为了进一步丰富解释内容,我们提议了名为个性化展示的新任务,其中我们提供文字和视觉信息来解释我们的建议。具体地说,我们首先选择了与用户对推荐项目的兴趣最相关的个性化图像组。然后,根据我们选定的图像,自然语言解释因此产生。对于这一新任务,我们从谷歌地方(即~maps)收集了一个大型数据集,并构建了一个用于生成多模式解释的高质量子集。我们提出了一个个性化多模式框架,通过对比性学习产生多样化和视觉一致的解释。实验表明,我们的框架从不同模式作为投入而受益,并且能够产生与以前关于各种评价指标的方法相比更加多样化和表达性的解释。