When the environment is partially observable (PO), a deep reinforcement learning (RL) agent must learn a suitable temporal representation of the entire history in addition to a strategy to control. This problem is not novel, and there have been model-free and model-based algorithms proposed for this problem. However, inspired by recent success in model-free image-based RL, we noticed the absence of a model-free baseline for history-based RL that (1) uses full history and (2) incorporates recent advances in off-policy continuous control. Therefore, we implement recurrent versions of DDPG, TD3, and SAC (RDPG, RTD3, and RSAC) in this work, evaluate them on short-term and long-term PO domains, and investigate key design choices. Our experiments show that RDPG and RTD3 can surprisingly fail on some domains and that RSAC is the most reliable, reaching near-optimal performance on nearly all domains. However, one task that requires systematic exploration still proved to be difficult, even for RSAC. These results show that model-free RL can learn good temporal representation using only reward signals; the primary difficulty seems to be computational cost and exploration. To facilitate future research, we have made our PyTorch implementation publicly available at https://github.com/zhihanyang2022/off-policy-continuous-control.
翻译:当环境部分可见(PO)时,一个深层强化学习(RL)代理机构除了控制战略外,还必须学习对整个历史的适当时间代表。这个问题并不是新颖的,已经为这一问题提出了无模型和基于模型的算法。然而,由于最近在无模型图像的RL方面取得的成功,我们注意到没有基于历史的RL的无模型基线:(1) 使用完整的历史,(2) 包括最近非政策性持续控制的进展。因此,我们在这项工作中执行DDPG、TD3和SAC(RDPG、RTD3和RSAC)的经常版本,在短期和长期的PO域上评价它们,并调查关键的设计选择。我们的实验显示,RDPG和RTD3在某些领域可能出乎意料地失败,RSAC是最可靠的,几乎在所有领域都接近最佳的业绩。然而,即使对RSAC,一项需要系统探索的任务仍然证明是困难的。这些结果显示,只有奖励信号才能让模型自由RL学习良好的时间代表;我们的主要困难似乎是公开计算成本和研究。