Reinforcement learning (RL) has been extensively researched for enhancing human-environment interactions in various human-centric tasks, including e-learning and healthcare. Since deploying and evaluating policies online are high-stakes in such tasks, off-policy evaluation (OPE) is crucial for inducing effective policies. In human-centric environments, however, OPE is challenging because the underlying state is often unobservable, while only aggregate rewards can be observed (students' test scores or whether a patient is released from the hospital eventually). In this work, we propose a human-centric OPE (HOPE) to handle partial observability and aggregated rewards in such environments. Specifically, we reconstruct immediate rewards from the aggregated rewards considering partial observability to estimate expected total returns. We provide a theoretical bound for the proposed method, and we have conducted extensive experiments in real-world human-centric tasks, including sepsis treatments and an intelligent tutoring system. Our approach reliably predicts the returns of different policies and outperforms state-of-the-art benchmarks using both standard validation methods and human-centric significance tests.
翻译:在各种以人为中心的任务中,包括电子学习和保健,为加强人类-环境互动,对强化学习(RL)进行了广泛的研究; 由于在线部署和评价政策在这类任务中占有很高比例,因此,非政策评价对于制定有效政策至关重要; 然而,在以人为中心的环境中,促进就业方案具有挑战性,因为基本状态往往不可观察,而只能看到总回报(学生的测试分数或病人最终是否从医院释放); 在这项工作中,我们提议以人为中心的OPE (HOPE) 处理这种环境中部分可接受性和综合回报。具体地说,我们从综合回报中重建直接的奖励,考虑部分可不易估计预期总回报。我们为拟议的方法提供了理论约束,我们在现实世界的以人为中心的任务中进行了广泛的实验,包括Sepsis治疗和智能辅导系统。我们的方法可靠地预测了不同政策的回报,并利用标准验证方法和以人为中心的重要性测试,超常规基准。