We consider a reinforcement learning setting in which the deployment environment is different from the training environment. Applying a robust Markov decision processes formulation, we extend the distributionally robust $Q$-learning framework studied in Liu et al. [2022]. Further, we improve the design and analysis of their multi-level Monte Carlo estimator. Assuming access to a simulator, we prove that the worst-case expected sample complexity of our algorithm to learn the optimal robust $Q$-function within an $\epsilon$ error in the sup norm is upper bounded by $\tilde O(|S||A|(1-\gamma)^{-5}\epsilon^{-2}p_{\wedge}^{-6}\delta^{-4})$, where $\gamma$ is the discount rate, $p_{\wedge}$ is the non-zero minimal support probability of the transition kernels and $\delta$ is the uncertainty size. This is the first sample complexity result for the model-free robust RL problem. Simulation studies further validate our theoretical results.
翻译:我们考虑一个强化学习环境,其部署环境与培训环境不同。运用一个强大的Markov决定程序配方,我们扩展了在刘等人([2022年])中研究的分布式强的Q$学习框架。此外,我们改进了他们的多层次蒙特卡洛估计值的设计和分析。假设使用模拟器,我们证明我们算法中最差的样本复杂性,以在美元标准值错误中学习最强的Q$功能。这是无模型稳健的RL问题的第一个抽样复杂性结果。 模拟研究进一步证实了我们的理论结果。</s>