Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization that constrains the policy to perform actions within the support set of the behavior policy. The elaborative designs of parameterization methods usually intrude into the policy networks, which may bring extra inference cost and cannot take full advantage of well-established online methods. Regularization methods reduce the divergence between the learned policy and the behavior policy, which may mismatch the inherent density-based definition of support set thereby failing to avoid the out-of-distribution actions effectively. This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint. SPOT adopts a VAE-based density estimator to explicitly model the support set of behavior policy and presents a simple but effective density-based regularization term, which can be plugged non-intrusively into off-the-shelf off-policy RL algorithms. On the standard benchmarks for offline RL, SPOT substantially outperforms state-of-the-art offline RL methods. Benefiting from the pluggable design, the offline pretrained models from SPOT can also be applied to perform online fine-tuning seamlessly.
翻译:离线强化学习(RL)的政策约束方法通常使用参数化或正规化,限制政策在行为政策支持下采取行动。参数化方法的推理设计通常侵入政策网络,这可能带来额外的推论成本,无法充分利用完善的在线方法。常规化方法减少了所学政策和行为政策之间的差异,可能与固有的基于密度的支持定义不匹配,从而无法有效避免分配外行动。本文介绍了支持性政策优化(SPOT),这是由基于密度的支持约束的理论正规化直接衍生的。SPOT采用了基于VAE的密度估计器,以明确模拟基于行为政策的模型,并提出了一个简单但有效的基于密度的规范化术语,可以将非侵入性连接到现成的非现成的基于政策的RL算法。关于离线性RL的标准基准,SPOT大大超越了离线支持性政策优化(SPOT)的状态。从封闭式设计到无缝的SPOT前调整模型,还可以从在线测试性调整中获取。