We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
翻译:我们研究了部分可观察马尔可夫决策过程中的离线强化学习。特别是,我们旨在从由行为策略收集的数据集中学习最优策略,该行为策略可能取决于潜在状态。这样的数据集在潜在状态同时影响动作和观测时是不允许的,这对现有的离线强化学习算法是禁止的。为此,我们提出了\texttt{P3O}算法,即代理变量悲观性策略优化算法,它在一般函数逼近环境下解决了混淆偏差和最优和行为策略之间的分布漂移问题。在\texttt{P3O}的核心是,通过近端因果推断构建了一对悲观置信区间的序列,其被证明是通过偏小极小值估计实现的。在关于混淆数据集的部分覆盖假设下,我们证明\texttt{P3O}实现了$n^{-1/2}$-子优性,其中$n$是数据集中轨迹的数量。据我们所知,\texttt{P3O}是第一个针对混淆数据集的POMDPs具有可证明有效性的离线强化学习算法。