基于记忆的连续控制非政策基线 (Recurrent Off-policy Baselines for Memory-based Continuous Control)

When the environment is partially observable (PO), a deep reinforcement learning (RL) agent must learn a suitable temporal representation of the entire history in addition to a strategy to control. This problem is not novel, and there have been model-free and model-based algorithms proposed for this problem. However, inspired by recent success in model-free image-based RL, we noticed the absence of a model-free baseline for history-based RL that (1) uses full history and (2) incorporates recent advances in off-policy continuous control. Therefore, we implement recurrent versions of DDPG, TD3, and SAC (RDPG, RTD3, and RSAC) in this work, evaluate them on short-term and long-term PO domains, and investigate key design choices. Our experiments show that RDPG and RTD3 can surprisingly fail on some domains and that RSAC is the most reliable, reaching near-optimal performance on nearly all domains. However, one task that requires systematic exploration still proved to be difficult, even for RSAC. These results show that model-free RL can learn good temporal representation using only reward signals; the primary difficulty seems to be computational cost and exploration. To facilitate future research, we have made our PyTorch implementation publicly available at https://github.com/zhihanyang2022/off-policy-continuous-control.

翻译：当环境部分可见(PO)时,一个深层强化学习(RL)代理机构除了控制战略外,还必须学习对整个历史的适当时间代表。这个问题并不是新颖的,已经为这一问题提出了无模型和基于模型的算法。然而,由于最近在无模型图像的RL方面取得的成功,我们注意到没有基于历史的RL的无模型基线:(1) 使用完整的历史,(2) 包括最近非政策性持续控制的进展。因此,我们在这项工作中执行DDPG、TD3和SAC(RDPG、RTD3和RSAC)的经常版本,在短期和长期的PO域上评价它们,并调查关键的设计选择。我们的实验显示,RDPG和RTD3在某些领域可能出乎意料地失败,RSAC是最可靠的,几乎在所有领域都接近最佳的业绩。然而,即使对RSAC,一项需要系统探索的任务仍然证明是困难的。这些结果显示,只有奖励信号才能让模型自由RL学习良好的时间代表;我们的主要困难似乎是公开计算成本和研究。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【AAAI2021】自校正Q学习，Self-correcting Q-Learning

专知会员服务

17+阅读 · 2020年12月4日

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

专知会员服务

17+阅读 · 2020年7月14日

【DeepMind深度学习课程】序列循环神经网络，141页ppt，Sequences and Recurrent Network

专知会员服务

86+阅读 · 2020年6月23日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日