The machine learning toolbox for estimation of heterogeneous treatment effects from observational data is expanding rapidly, yet many of its algorithms have been evaluated only on a very limited set of semi-synthetic benchmark datasets. In this paper, we show that even in arguably the simplest setting -- estimation under ignorability assumptions -- the results of such empirical evaluations can be misleading if (i) the assumptions underlying the data-generating mechanisms in benchmark datasets and (ii) their interplay with baseline algorithms are inadequately discussed. We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators -- the IHDP and ACIC2016 datasets -- in detail. We identify problems with their current use and highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others -- a fact that is rarely acknowledged but of immense relevance for interpretation of empirical results. We close by discussing implications and possible next steps.
翻译:用于估计观测数据产生的不同处理效应的机器学习工具箱正在迅速扩展,但其许多算法只对非常有限的一套半合成基准数据集进行了详细评价。在本文中,我们表明,即使在可以说最简单的设置中 -- -- 在忽视假设下估计 -- -- 这种经验性评价的结果可能会产生误导,如果(一) 对基准数据集中数据生成机制所依据的假设和(二) 对数据与基线算法的相互作用的讨论不够充分。我们认为,用于评价不同处理效应估计数的两个流行的机器学习基准数据集 -- -- IHDP 和 ACIC2016数据集 -- -- 都得到了详细评价。我们查明了目前使用这些基准数据集的问题,并强调基准数据集的内在特征有利于其他算法 -- -- 这一事实很少得到承认,但对经验结果的解释却具有极大的相关性。我们最后讨论了影响和下一步可能采取的步骤。