Accurate recommendation and reliable explanation are two key issues for modern recommender systems. However, most recommendation benchmarks only concern the prediction of user-item ratings while omitting the underlying causes behind the ratings. For example, the widely-used Yahoo!R3 dataset contains little information on the causes of the user-movie ratings. A solution could be to conduct surveys and require the users to provide such information. In practice, the user surveys can hardly avoid compliance issues and sparse user responses, which greatly hinders the exploration of causality-based recommendation. To better support the studies of causal inference and further explanations in recommender systems, we propose a novel semi-synthetic data generation framework for recommender systems where causal graphical models with missingness are employed to describe the causal mechanism of practical recommendation scenarios. To illustrate the use of our framework, we construct a semi-synthetic dataset with Causal Tags And Ratings (CTAR), based on the movies as well as their descriptive tags and rating information collected from a famous movie rating website. Using the collected data and the causal graph, the user-item-ratings and their corresponding user-item-tags are automatically generated, which provides the reasons (selected tags) why the user rates the items. Descriptive statistics and baseline results regarding the CTAR dataset are also reported. The proposed data generation framework is not limited to recommendation, and the released APIs can be used to generate customized datasets for other research tasks.
翻译:准确的建议和可靠的解释是现代建议系统的两个关键问题。然而,大多数建议基准仅涉及预测用户项目评级,而忽略评级背后的根本原因。例如,广泛使用的Yahoo!R3数据集几乎没有关于用户-电影评级原因的信息。一个解决办法可能是进行调查,要求用户提供这种信息。在实践中,用户调查很难避免合规问题和用户反应稀少,这严重阻碍了对因果关系依据建议进行探索。为了更好地支持对建议系统因果关系推断和进一步解释的研究,我们提议为推荐系统建立一个新型半合成数据生成框架,其中采用因果图形模型缺失描述实际建议情景的因果机制。为了说明我们框架的使用情况,我们用Causal标签和评分(CTAR)建立一个半合成数据集,根据电影以及从著名电影评级网站收集的描述性标签和评级信息。利用所收集的数据和因果图表、用户项目评级以及用户标定的用户标定数据比率,也自动提供用于生成数据的基准数据。