Individual Treatment Effect (ITE) prediction is an important area of research in machine learning which aims at explaining and estimating the causal impact of an action at the granular level. It represents a problem of growing interest in multiple sectors of application such as healthcare, online advertising or socioeconomics. To foster research on this topic we release a publicly available collection of 13.9 million samples collected from several randomized control trials, scaling up previously available datasets by a healthy 210x factor. We provide details on the data collection and perform sanity checks to validate the use of this data for causal inference tasks. First, we formalize the task of uplift modeling (UM) that can be performed with this data, along with the relevant evaluation metrics. Then, we propose synthetic response surfaces and heterogeneous treatment assignment providing a general set-up for ITE prediction. Finally, we report experiments to validate key characteristics of the dataset leveraging its size to evaluate and compare - with high statistical significance - a selection of baseline UM and ITE prediction methods.
翻译:个人治疗效果(ITE)预测是机器学习的一个重要研究领域,旨在解释和估计颗粒级行动因果影响,这是一个对多种应用部门,如保健、在线广告或社会经济等越来越感兴趣的问题。为了促进关于这个专题的研究,我们公布从若干随机控制试验中收集的1 390万个样本,通过健康的210x系数扩大以前可得到的数据集。我们提供关于数据收集的细节,并进行理智检查,以验证利用这一数据进行因果关系推断任务。首先,我们正式确定利用这些数据进行升级模型的任务,同时确定相关的评估指标。然后,我们提出合成反应表面和多种治疗任务,为ITE预测提供一个总体的设置。最后,我们报告利用数据组的大小来评估和比较(具有高度统计意义的)基准UM和ITE预测方法的选定,以验证其关键特征的实验。