Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.
翻译:強化學習旨在通過數據識別和評估高效控制策略。在許多現實世界中,學習者不允許進行實驗,也不能以在線方式收集數據(當實驗成本高昂,風險很高或不道德時就是這種情況)。對於這樣的應用,必須使用在不同策略(行為策略)下收集的歷史數據來估計特定策略(目標策略)的獎勵。大多數進行這一學習任務的方法,稱為離線評估(OPE),沒有精度和確定性保證。我們提出了一種基於一致性預測的 OPE 方法,該方法輸出包含目標策略的實際獎勵的區間,並給出指定程度的確定性保證。 OPE中的主要挑戰源於由於目標和行為策略之間存在差異而導致的分佈漂移。我們提出並實際評估了處理此漂移的不同方法。其中一些方法產生了比現有方法更短的一致性區間,同時保持相同的確定性水平。