Existing explanation models generate only text for recommendations but still struggle to produce diverse contents. In this paper, to further enrich explanations, we propose a new task named personalized showcases, in which we provide both textual and visual information to explain our recommendations. Specifically, we first select a personalized image set that is the most relevant to a user's interest toward a recommended item. Then, natural language explanations are generated accordingly given our selected images. For this new task, we collect a large-scale dataset from Google Local (i.e.,~maps) and construct a high-quality subset for generating multi-modal explanations. We propose a personalized multi-modal framework which can generate diverse and visually-aligned explanations via contrastive learning. Experiments show that our framework benefits from different modalities as inputs, and is able to produce more diverse and expressive explanations compared to previous methods on a variety of evaluation metrics.
翻译:现有的解释模型仅为推荐生成文本,但仍然难以产生各种内容。在本文中,我们提出了一个名为“个性化展示”的新任务,其在解释我们的推荐时提供了文本和视觉信息。具体而言,我们首先选择一个最相关于用户对推荐物品的兴趣的个性化图像集。然后,根据我们选择的图像生成自然语言解释。对于这个新任务,我们从Google Local(即地图)收集了大规模的数据集,并构建了一个高质量的子集来生成多模式解释。我们提出了一个个性化多模式框架,可以通过对比学习生成不同的、视觉对齐的解释。实验表明,相对于以前的方法,我们的框架从不同的输入模式中受益,并能够在各种评估指标上产生更多样化和表现更好的解释。