The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real-world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems.
翻译:强力强化学习(RL)的目标是学习一种针对模型参数不确定性的稳健政策。参数不确定性通常发生在许多真实世界的RL应用中,原因是模拟模型错误、时间里真实世界系统动态的变化以及对抗性扰动。强力RL通常是一个最大问题,目的是学习一种政策,以最大的价值对抗一组不确定性中最差的可能模型。在这项工作中,我们提议一种称为Robust Fiteded Q-Iberation(RFQI)的强力RL算法(RFQI)的强力RL算法,它只使用离线数据集学习最佳的稳健政策。使用离线数据的Robust RL比非线数据相对应的强得多,因为强力贝尔曼操作员目前所有模型的最小化。这给离线数据收集、优化模型和公允估计带来了挑战。在这项工作中,我们建议一种系统的方法来克服这些挑战,导致我们的RFQI算法。我们证明,RFQI在标准假设下学习了近opimal稳健政策,并展示其高标准性业绩。