Offline reinforcement learning has developed rapidly over the recent years, but estimating the actual performance of offline policies still remains a challenge. We propose a scoring metric for offline policies that highly correlates with actual policy performance and can be directly used for offline policy optimization in a supervised manner. To achieve this, we leverage the contrastive learning framework to design a scoring metric that gives high scores to policies that imitate the actions yielding relatively high returns while avoiding those yielding relatively low returns. Our experiments show that 1) our scoring metric is able to more accurately rank offline policies and 2) the policies optimized using our metric show high performance on various offline reinforcement learning benchmarks. Notably, our algorithm has a much lower network capacity requirement for the policy network compared to other supervised learning-based methods and also does not need any additional networks such as a Q-network.
翻译:近些年来,脱线强化学习发展迅速,但估计离线政策的实际业绩仍是一项挑战。我们建议对离线政策采用一个评分标准,该标准与实际政策业绩高度相关,并可在监督下直接用于离线政策优化。为此,我们利用对比式学习框架设计一个评分标准,该标准给模仿产生相对高回报而避免产生相对低回报的政策提供高分。我们的实验表明:(1)我们的分数标准能够对离线政策进行更准确的排序,(2)利用我们的衡量标准优化的政策显示了各种脱线强化学习基准的高绩效。值得注意的是,与其它受监督的学习方法相比,我们的算法对政策网络的网络能力要求要低得多,也不需要任何额外的网络,例如Q网络。