Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. We further propose a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the Jensen--Shannon divergence and the integral probability metrics. We theoretically show the correctness of the policy-matching approach, and the correctness and a good finite-sample property of our modification. An effective instantiation of our framework through the GAN structure is provided, together with techniques to explicitly smooth the state-action mapping for robust generalization beyond the static dataset. Extensive experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
翻译:缺乏环境互动使得政策培训容易受到州际行动对口的伤害,远离培训数据集,容易丢失奖励性行动。为了培训更有效的代理机构,我们提议了一个框架,支持学习灵活而规范化的全面隐性政策。我们进一步建议简单修改典型的政策匹配方法,以规范Jensen-Shannon差异和整体概率度量的双重形式。我们理论上显示了政策匹配方法的正确性,以及我们修改的正确性和良好的有限抽样特性。我们提供了通过GAN结构对框架进行有效的即时利用GAN结构,以及明确顺利绘制州际行动图,以稳健地概括到静态数据集之外。关于D4RL数据集的广泛实验和对比研究证实了我们的框架和我们的算法设计的有效性。