We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
翻译:在部分可见的 Markov 决策程序中,我们研究离线强化学习(RL) 部分可见的 Markov 。 特别是, 我们的目标是从行为政策收集的、 可能取决于潜伏状态的数据集中学习最佳政策。 这种数据集令人困惑的是, 潜伏状态同时影响动作和观察, 这对现有的离线 RL 算法来说是令人望而生畏的。 为此, 我们建议使用下线{ P} 旋律变量 { 下线{ P} 悲观 \ 底线{ O} 优化 (\ textt{ P3O}) 算法, 它解决了在一般函数近似背景下最佳政策和行为政策之间的折叠叠偏和分布变化。 在\ texttt{ P3O} 核心中, 悲观性信任区域是结合的序列, 以准因因果关系推算为最小值估计。 在对自建数据设置的部分覆盖假设下, 我们证明\ titt{ P3O} 算出一个最高值的 RN%-1/2 数据值。