We propose a novel model-based offline Reinforcement Learning (RL) framework, called Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary reference policy regardless of data coverage. ARMOR is designed to optimize policies for the worst-case performance relative to the reference policy through adversarially training a Markov decision process model. In theory, we prove that ARMOR, with a well-tuned hyperparameter, can compete with the best policy within data coverage when the reference policy is supported by the data. At the same time, ARMOR is robust to hyperparameter choices: the policy learned by ARMOR, with "any" admissible hyperparameter, would never degrade the performance of the reference policy, even when the reference policy is not covered by the dataset. To validate these properties in practice, we design a scalable implementation of ARMOR, which by adversarial training, can optimize policies without using model ensembles in contrast to typical model-based methods. We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms and can robustly improve the reference policy over various hyperparameter choices.
翻译:我们提出一个新的基于模型的离线强化学习(RL)框架,称为“离线强化学习Adversarial Model”(ARMOR),这个框架可以有力地学习政策,改进任意的参考政策,而不管数据覆盖范围如何。ARMOR的目的是通过对抗性培训Markov决定程序模型,优化参考政策中最坏情况业绩的政策。在理论上,我们证明ARMOR,如果得到数据支持的参考政策得到完善的超光谱仪,就可以在数据覆盖面内与最佳政策竞争。与此同时,ARMOR对超参数选择是强大的:ARMOR所学的政策,无论“任何”可接受的超光谱仪,都不会削弱参考政策的业绩,即使该数据集没有覆盖参考政策。为了在实践中验证这些特性,我们设计了可缩放的ARMORM(ARM)实施,通过对抗性培训,可以优化政策,而不会使用模型结合典型的模型方法。我们证明ARMORMM(AR)既能胜任地使用最先进的离线模型和基于模型的模型的模型的矩阵,又能改进各种稳健的参考的RL算法。