In offline RL, constraining the learned policy to remain close to the data is essential to prevent the policy from outputting out-of-distribution (OOD) actions with erroneously overestimated values. In principle, generative adversarial networks (GAN) can provide an elegant solution to do so, with the discriminator directly providing a probability that quantifies distributional shift. However, in practice, GAN-based offline RL methods have not performed as well as alternative approaches, perhaps because the generator is trained to both fool the discriminator and maximize return -- two objectives that can be at odds with each other. In this paper, we show that the issue of conflicting objectives can be resolved by training two generators: one that maximizes return, with the other capturing the ``remainder'' of the data distribution in the offline dataset, such that the mixture of the two is close to the behavior policy. We show that not only does having two generators enable an effective GAN-based offline RL method, but also approximates a support constraint, where the policy does not need to match the entire data distribution, but only the slice of the data that leads to high long term performance. We name our method DASCO, for Dual-Generator Adversarial Support Constrained Offline RL. On benchmark tasks that require learning from sub-optimal data, DASCO significantly outperforms prior methods that enforce distribution constraint.
翻译:在离线的RL中,限制学习到的政策与数据保持距离对于防止该政策以错误高估的数值输出分配外(OOOD)行动至关重要。原则上,基因对抗网络(GAN)可以提供优雅的解决方案,因为歧视者直接提供了量化分布变化的概率。但在实践上,基于GAN的离线RL方法没有同时发挥其他方法的作用,也许因为发电机既训练既能愚弄歧视者,又能最大限度地返回 -- -- 两个目标可能相互对立。在本文件中,我们表明对矛盾的目标问题可以通过培训两个生成者来解决:一个是最大限度地回报的,另一个是在离线数据集中捕捉数据分布的“Remainder”的“Remander”,另一个是接近行为政策的组合。我们表明,不仅有两个发电机能够使基于GAN的离线的离线式RL方法产生有效的GAN,而且还可以比较一种支持性制约,即政策不需要与整个数据分布相匹配,而只是从前SCO基准方法的缩略取的DSARCG。