Offline reinforcement learning approaches can generally be divided to proximal and uncertainty-aware methods. In this work, we demonstrate the benefit of combining the two in a latent variational model. We impose a latent representation of states and actions and leverage its intrinsic Riemannian geometry to measure distance of latent samples to the data. Our proposed metrics measure both the quality of out of distribution samples as well as the discrepancy of examples in the data. We integrate our metrics in a model-based offline optimization framework, in which proximity and uncertainty can be carefully controlled. We illustrate the geodesics on a simple grid-like environment, depicting its natural inherent topology. Finally, we analyze our approach and improve upon contemporary offline RL benchmarks.
翻译:离线强化学习方法一般可以分为近似和有不确定性的学习方法。 在这项工作中,我们展示了将两者结合到潜伏变异模型中的好处。我们把国家和行动的潜伏代表制定为潜在代表制,并利用其内在的里曼式几何测量法测量潜在样本与数据之间的距离。我们提议的衡量尺度既测量分布样本的质量,也测量数据中实例的差异。我们把我们的衡量尺度纳入一个基于模型的离线优化框架,从而可以仔细控制距离和不确定性。我们用简单的网格环境来说明大地测量学,描绘其自然固有的地形学。最后,我们分析了我们的方法,并改进了当代离线RL基准。