Treatment heterogeneity is ubiquitous in many areas, motivating practitioners to search for the optimal policy that maximizes the expected outcome based on individualized characteristics. However, most existing policy learning methods rely on weighting-based approaches, which may suffer from high instability in observational studies. To enhance the robustness of the estimated policy, we propose a matching-based estimator of the policy improvement upon a randomized baseline. After correcting the conditional bias, we learn the optimal policy by maximizing the estimate over a policy class. We derive a non-asymptotic high probability bound for the regret of the learned policy and show that the convergence rate is almost $1/\sqrt{n}$. The competitive finite sample performance of the proposed method is demonstrated in extensive simulation studies and a real data application.
翻译:暂无翻译