Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations, to improve user trust and system transparency. Although recent advances leverage LLMs to produce fluent outputs, a critical question remains underexplored: are these explanations factually consistent with the available evidence? We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders. We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews, thereby constructing a ground truth that isolates and focuses on their factual content. Applying this pipeline to five categories from the Amazon Reviews dataset, we create augmented benchmarks for fine-grained evaluation of explanation quality. We further propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations. Across extensive experiments on six state-of-the-art explainable recommendation models, we uncover a critical gap: while models achieve high semantic similarity scores (BERTScore F1: 0.81-0.90), all our factuality metrics reveal alarmingly low performance (LLM-based statement-level precision: 4.38%-32.88%). These findings underscore the need for factuality-aware evaluation in explainable recommendation and provide a foundation for developing more trustworthy explanation systems.
翻译:基于文本的可解释推荐旨在生成自然语言解释以论证项目推荐,从而提升用户信任与系统透明度。尽管近期研究利用大语言模型生成流畅的输出,一个关键问题仍未得到充分探究:这些解释是否与可用证据保持事实一致性?我们提出了一个评估基于文本的可解释推荐模型事实一致性的综合框架。我们设计了一个基于提示的流程,利用大语言模型从评论中提取原子化解释性陈述,从而构建专注于其事实内容的地面真值。将该流程应用于亚马逊评论数据集中的五个类别后,我们创建了用于细粒度解释质量评估的增强基准。我们进一步提出了结合大语言模型与自然语言推理的陈述级对齐指标,以评估生成解释的事实一致性与相关性。通过对六个前沿可解释推荐模型的大量实验,我们揭示了一个关键差距:虽然模型获得了较高的语义相似度分数(BERTScore F1:0.81-0.90),但所有事实性指标均显示出令人担忧的低性能(基于大语言模型的陈述级精确率:4.38%-32.88%)。这些发现强调了在可解释推荐中引入事实感知评估的必要性,并为开发更可信的解释系统奠定了基础。