Offline reinforcement learning (RL) allows for the training of competent agents from offline datasets without any interaction with the environment. Online finetuning of such offline models can further improve performance. But how should we ideally finetune agents obtained from offline RL training? While offline RL algorithms can in principle be used for finetuning, in practice, their online performance improves slowly. In contrast, we show that it is possible to use standard online off-policy algorithms for faster improvement. However, we find this approach may suffer from policy collapse, where the policy undergoes severe performance deterioration during initial online learning. We investigate the issue of policy collapse and how it relates to data diversity, algorithm choices and online replay distribution. Based on these insights, we propose a conservative policy optimization procedure that can achieve stable and sample-efficient online learning from offline pretraining.
翻译:离线强化学习允许在不与环境交互的情况下训练具有竞争力的智能体。对这些离线训练模型在线微调可以进一步提高性能。但是,我们应该如何理想地微调离线强化学习训练得到的智能体呢?离线强化学习算法原则上可以用于微调,但在实践中,它们的在线性能提升速度较慢。相反,我们发现可以使用标准的在线离策略算法实现更快的性能提升。然而,我们还发现这种方法可能受到策略崩溃的影响,即策略在初始的在线学习过程中会遭受严重的性能恶化。我们研究了策略崩溃问题及其与数据多样性、算法选择和在线回放分布的关系。基于这些见解,我们提出了一种谨慎的策略优化过程,可以实现离线预训练后稳定而且高效的在线学习。