We propose a new model-based offline RL framework, called Adversarial Models for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary baseline policy regardless of data coverage. Based on the concept of relative pessimism, ARMOR is designed to optimize for the worst-case relative performance when facing uncertainty. In theory, we prove that the learned policy of ARMOR never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when the hyperparameter is well tuned, and the baseline policy is supported by the data. Such a robust policy improvement property makes ARMOR especially suitable for building real-world learning systems, because in practice ensuring no performance degradation is imperative before considering any benefit learning can bring.
翻译:我们提出了一个新的基于模型的脱线RL框架,称为“离线强化学习反转模型 ” ( Aversarial Models for Offline Estruction Learning,ARMOR),这一框架可以有力地学习政策来改进任意的基线政策,而不管数据覆盖范围如何。基于相对悲观主义的概念,ARMOR旨在优化面临不确定性时最坏的相对业绩。 在理论上,我们证明ARMOR的学习政策从未以任何可接受的超参数来降低基线政策的业绩,而且当超光度计调整良好且基线政策得到数据支持时,我们能够学习与数据覆盖范围内的最佳政策竞争。 这种强大的政策改进性能使得ARMOR特别适合建设现实世界学习系统,因为在实践上,在考虑任何受益学习能够带来任何好处之前,必须确保业绩的退化。