Offline reinforcement learning aims to find the optimal policy from a pre-collected dataset without active exploration. This problem is faced with major challenges, such as a limited amount of data and distribution shift. Existing studies employ the principle of pessimism in face of uncertainty, and penalize rewards for less visited state-action pairs. In this paper, we directly model the uncertainty in the transition kernel using an uncertainty set, and then employ the approach of distributionally robust optimization that optimizes the worst-case performance over the uncertainty set. We first design a Hoeffding-style uncertainty set, which guarantees that the true transition kernel lies in the uncertainty set with high probability. We theoretically prove that it achieves an $\epsilon$-accuracy with a sample complexity of $\mathcal{O}\left((1-\gamma)^{-4}\epsilon^{-2}SC^{\pi^*} \right)$, where $\gamma$ is the discount factor, $C^{\pi^*}$ is the single-policy concentrability for any comparator policy $\pi^*$, and $S$ is the number of states. We further design a Bernstein-style uncertainty set, which does not necessarily guarantee the true transition kernel lies in the uncertainty set. We show an improved and near-optimal sample complexity of $\mathcal{O}\left((1-\gamma)^{-3}\epsilon^{-2}\left(SC^{\pi^*}+(\mu_{\min})^{-1}\right) \right)$, where $\mu_{\min}$ denotes the minimal non-zero entry of the behavior distribution. In addition, the computational complexity of our algorithms is the same as one of the LCB-based methods in the literature. Our results demonstrate that distributionally robust optimization method can also efficiently solve offline reinforcement learning.
翻译:暂无翻译