Personalized treatment effect estimates are often of interest in high-stakes applications -- thus, before deploying a model estimating such effects in practice, one needs to be sure that the best candidate from the ever-growing machine learning toolbox for this task was chosen. Unfortunately, due to the absence of counterfactual information in practice, it is usually not possible to rely on standard validation metrics for doing so, leading to a well-known model selection dilemma in the treatment effect estimation literature. While some solutions have recently been investigated, systematic understanding of the strengths and weaknesses of different model selection criteria is still lacking. In this paper, instead of attempting to declare a global `winner', we therefore empirically investigate success- and failure modes of different selection criteria. We highlight that there is a complex interplay between selection strategies, candidate estimators and the DGP used for testing, and provide interesting insights into the relative (dis)advantages of different criteria alongside desiderata for the design of further illuminating empirical studies in this context.
翻译:个人化治疗效果估计往往对高摄入量应用感兴趣 -- -- 因此,在实际运用模型估计这种效果之前,人们需要确定来自日益增长的机器学习工具箱的最佳候选人是否选择了这一任务的最佳候选人。不幸的是,由于在实践中缺乏反事实信息,通常无法依赖标准验证指标来这样做,导致在治疗效果估计文献中出现众所周知的模式选择困境。虽然最近已经调查了一些解决办法,但仍然缺乏对不同模式选择标准的优缺点的系统理解。本文没有试图宣布全球“赢家”,而是对不同选择标准的成功和失败模式进行了经验性调查。我们强调,选择战略、候选人估计和用于测试的DGP之间有着复杂的相互作用,并且对不同标准的相对(差)优势以及设计这方面进一步启发性经验研究的侧面设计提供了有趣的洞察。