Importance sampling (IS) is often used to perform off-policy policy evaluation but is prone to several issues, especially when the behavior policy is unknown and must be estimated from data. Significant differences between the target and behavior policies can result in uncertain value estimates due to, for example, high variance and non-evaluated actions. If the behavior policy is estimated using black-box models, it can be hard to diagnose potential problems and to determine for which inputs the policies differ in their suggested actions and resulting values. To address this, we propose estimating the behavior policy for IS using prototype learning. We apply this approach in the evaluation of policies for sepsis treatment, demonstrating how the prototypes give a condensed summary of differences between the target and behavior policies while retaining an accuracy comparable to baseline estimators. We also describe estimated values in terms of the prototypes to better understand which parts of the target policies have the most impact on the estimates. Using a simulator, we study the bias resulting from restricting models to use prototypes.
翻译:重要性抽样(IS)通常用于进行非政策性政策评价,但容易出现一些问题,特别是行为政策未知,必须从数据中估算。目标和行为政策之间的重大差异可能导致价值估计不确定,例如,由于差异大和未评估的行动。如果行为政策使用黑盒模型估算,那么很难诊断潜在问题并确定政策在建议的行动和结果价值中有哪些投入不同。为了解决这个问题,我们提议利用原型学习来估算IS的行为政策。我们在评估消毒治疗政策时采用这一方法,说明原型如何对目标和行为政策之间的差异进行压缩汇总,同时保留与基线估计值相近的准确性。我们还用原型来描述估计值,以便更好地了解目标政策中哪些部分对估计影响最大。我们用模拟器研究限制模型使用原型所产生的偏差。