In offline reinforcement learning (RL) an optimal policy is learnt solely from a priori collected observational data. However, in observational data, actions are often confounded by unobserved variables. Instrumental variables (IVs), in the context of RL, are the variables whose influence on the state variables are all mediated through the action. When a valid instrument is present, we can recover the confounded transition dynamics through observational data. We study a confounded Markov decision process where the transition dynamics admit an additive nonlinear functional form. Using IVs, we derive a conditional moment restriction (CMR) through which we can identify transition dynamics based on observational data. We propose a provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of CMR. To the best of our knowledge, this is the first provably efficient algorithm for instrument-aided offline RL.
翻译:在离线强化学习(RL)中,一项最佳政策仅从事先收集的观测数据中学习,但在观测数据中,行动往往被未观测到的变量混在一起。在RL中,工具变量(IVs)是变量,对状态变量的影响都通过动作来调节。当存在有效的工具时,我们可以通过观测数据来恢复混杂的过渡动态。我们研究的是过渡动态允许添加非线性功能形式的复杂马尔科夫决策程序。我们使用IVs,得出一个有条件的时刻限制(CMR),我们可以通过它来根据观察数据来识别过渡动态。我们提议了一个基于对 CMRR进行初步-双重整的高效的IV-辅助值循环(IVVI)算法。据我们所知,这是用于仪器辅助离线 RL 的首个高效算法。