Offline reinforcement learning (RL) aims to infer sequential decision policies using only offline datasets. This is a particularly difficult setup, especially when learning to achieve multiple different goals or outcomes under a given scenario with only sparse rewards. For offline learning of goal-conditioned policies via supervised learning, previous work has shown that an advantage weighted log-likelihood loss guarantees monotonic policy improvement. In this work we argue that, despite its benefits, this approach is still insufficient to fully address the distribution shift and multi-modality problems. The latter is particularly severe in long-horizon tasks where finding a unique and optimal policy that goes from a state to the desired goal is challenging as there may be multiple and potentially conflicting solutions. To tackle these challenges, we propose a complementary advantage-based weighting scheme that introduces an additional source of inductive bias: given a value-based partitioning of the state space, the contribution of actions expected to lead to target regions that are easier to reach, compared to the final goal, is further increased. Empirically, we demonstrate that the proposed approach, Dual-Advantage Weighted Offline Goal-conditioned RL (DAWOG), outperforms several competing offline algorithms in commonly used benchmarks. Analytically, we offer a guarantee that the learnt policy is never worse than the underlying behaviour policy.
翻译:离线强化学习(RL)旨在推断仅使用离线数据集的顺序决策政策。这是一个特别困难的设置,特别是当学习在特定情景下实现多种不同的目标和结果时,只得到微薄的回报。关于通过监督学习而从离线学习目标确定的政策,以往的工作表明,优势加权日志损失保证了单一式政策改进。在这项工作中,尽管有其好处,但这一方法仍然不足以充分解决分配转移和多模式问题。在长期任务中,后者特别严重,因为找到从状态到预期目标的独特和最佳政策具有挑战性,因为可能存在多重和潜在冲突的解决办法。为了应对这些挑战,我们提议了一个基于优势的补充加权办法,提出一个诱导偏差的附加来源:鉴于基于价值对状态空间的分割,预期导致目标区域更容易达到,而与最终目标相比,行动的贡献进一步增加。我们设想,拟议的方法、双向顶端和最佳政策将具有挑战性,因为可能存在多重和潜在冲突性的解决办法。为了应对这些挑战,我们提议了一个基于优势的权衡优势加权政策,因此,我们从未采用更糟糕的分析方法。</s>