We introduce super reinforcement learning in the batch setting, which takes the observed action as input for enhanced policy learning. In the presence of unmeasured confounders, the recommendations from human experts recorded in the observed data allow us to recover certain unobserved information. Including this information in the policy search, the proposed super reinforcement learning will yield a super-policy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., the expert's recommendation). Furthermore, to address the issue of unmeasured confounding in finding super-policies, a number of non-parametric identification results are established. Finally, we develop two super-policy learning algorithms and derive their corresponding finite-sample regret guarantees.
翻译:在批量设置中,我们引入了超强学习,将观察到的行动作为强化政策学习的投入;在未计量的混乱者在场的情况下,在所观察到的数据中记录的人类专家的建议允许我们收回某些未观测到的信息;在政策搜索中包括这一信息,拟议的超强学习将产生一个超级政策,保证它既优于标准的最佳政策,也优于行为(例如专家的建议);此外,为了解决在寻找超级政策方面未计量的混杂问题,我们制定了一些非参数识别结果。 最后,我们开发了两个超级政策学习算法,并得出了相应的有限遗憾保证。