The performance of most causal effect estimators relies on accurate predictions of high-dimensional non-linear functions of the observed data. The remarkable flexibility of modern Machine Learning (ML) methods is perfectly suited to this task. However, data-driven hyperparameter tuning of ML methods requires effective model evaluation to avoid large errors in causal estimates, a task made more challenging because causal inference involves unavailable counterfactuals. Multiple performance-validation metrics have recently been proposed such that practitioners now not only have to make complex decisions about which causal estimators, ML learners and hyperparameters to choose, but also about which evaluation metric to use. This paper, motivated by unclear recommendations, investigates the interplay between the four different aspects of model evaluation for causal effect estimation. We develop a comprehensive experimental setup that involves many commonly used causal estimators, ML methods and evaluation approaches and apply it to four well-known causal inference benchmark datasets. Our results suggest that optimal hyperparameter tuning of ML learners is enough to reach state-of-the-art performance in effect estimation, regardless of estimators and learners. We conclude that most causal estimators are roughly equivalent in performance if tuned thoroughly enough. We also find hyperparameter tuning and model evaluation are much more important than causal estimators and ML methods. Finally, from the significant gap we find in estimation performance of popular evaluation metrics compared with optimal model selection choices, we call for more research into causal model evaluation to unlock the optimum performance not currently being delivered even by state-of-the-art procedures.
翻译:多数因果关系估计值的性能取决于对观测数据的高度非线性功能的准确预测。现代机器学习(ML)方法的显著灵活性完全适合这项任务。然而,数据驱动的超参数对ML方法的调整需要有效的模型评估以避免因果估计方面的重大错误,这项任务由于因因果推论涉及无法反证事实而变得更具有挑战性。最近提出了多种性能校验指标,这样,从业人员现在不仅必须就哪些因果估计器、ML学习者和超分数选择作出精确的预测,而且还要就哪些评价指标完全适合这项任务。本文以不明确的建议为动力,对因果估计模型的四个不同方面之间的相互作用进行了调查。我们开发了一个全面的实验设置,涉及许多常用的因果估计器、ML方法和评价方法,并应用于四个众所周知的因果推算基准数据集。我们的结果表明,即使是对ML学习者的最佳性能校准标准调整,目前也足以达到最佳性能评估模型的性能,而无论我们如何精确地估算,我们如何进行最大幅度的性能评估。我们的结论结论是,我们是否在最有相当的性能的性能的性能评估方法中,我们发现,我们是否在最后性能评估中发现,我们更接近性能评估中发现,我们更接近于最有相当的性能的性能的性能的性能评估方法。</s>