The beneficial effects of treatments vary across individuals in most studies. Treatment heterogeneity motivates practitioners to search for the optimal policy based on personal characteristics. A long-standing common practice in policy learning has been estimating and maximizing the value function using weighting techniques. Matching is widely used in many applied disciplines to infer causal effects, which is intuitively appealing because the observed covariates are directly balanced across different treatment groups. Nevertheless, matching is rarely explored in policy learning. In this work, we propose a matching-based policy learning framework. We adapt standard and bias-corrected matching methods to estimate an alternative form of the value function: the advantage function, which can be interpreted as the expected improvement achieved by implementing a given policy compared to the equiprobable random policy. We then learn the optimal policy over a restricted policy class by maximizing the matching estimator of the advantage function. We derive a non-asymptotic high probability bound for the regret of the learned optimal policy. Moreover, we show that the learned policy is almost rate-optimal. The competitive finite sample performance of the proposed method compared to weighting-based and outcome modeling-based learning methods is demonstrated in extensive simulation studies and a real data application.
翻译:暂无翻译