In recent years, the increasing availability of individual-level data has led to numerous applications of individualized (or personalized) treatment rules (ITRs). Policy makers often wish to empirically evaluate ITRs and compare their relative performance before implementing them in a target population. We propose a new evaluation metric, the population average prescriptive effect (PAPE). The PAPE compares the performance of ITR with that of non-individualized treatment rule, which randomly treats the same proportion of units. Averaging the PAPE over a range of budget constraints yields our second evaluation metric, the area under the prescriptive effect curve (AUPEC). The AUPEC represents an overall performance measure for evaluation, like the area under the receiver and operating characteristic curve (AUROC) does for classification. We use the Neyman's repeated sampling framework to estimate the PAPE and AUPEC and derive their exact finite-sample variances based on random sampling of units and random assignment of treatment. We also extend our analytical framework to a common evaluation setting, in which the same experimental data is used to both estimate and evaluate ITRs. In this case, our variance calculation incorporates the additional uncertainty due to random splits of data used for cross-validation. Unlike some of the existing methods, the proposed methodology does not require modeling assumptions, asymptotic approximation, or resampling method. As a result, it is applicable to any ITR including those based on complex machine learning algorithms. The open-source software package is available for implementing the proposed methodology.
翻译:近年来,个人一级数据日益容易获得,导致大量应用个人化(或个性化)处理规则(ITRs),决策者往往希望对ITRs进行实证性评估,比较其相对业绩,然后在目标人群中执行这些规则。我们提议了新的评价指标,即人口平均规定效果(PAPE)。PAPE将ITR的性能与非个人化处理规则的性能进行比较,后者随机处理相同比例的单位。在一系列预算限制方面对PAPE的性能进行核实后,得出了我们的第二次评估标准,即规范效应曲线(AUPEC)下的区域。AUPEC是评价的总体业绩计量,如接收者和操作特征曲线(AUROC)下的区域一样,用来进行分类。我们使用Neyman的反复抽样框架来估计PAPE和AUPEC(PE)的性能平均规定效果(PAPE),并根据随机抽样抽样和随机分配处理方法得出确切的定数差异。我们还将我们的分析框架扩大到一个共同的评估环境,在这个评估中,使用同样的实验数据用于估算和评价ITRs。在本案中,我们的差异计算中的一些差异,我们的计算方法包括了目前采用的随机分析方法,因为采用的方法是随机分析方法。