POMDP基于记忆的深强化学习 (Memory-based Deep Reinforcement Learning for POMDP)

A promising characteristic of Deep Reinforcement Learning (DRL) is its capability to learn optimal policy in an end-to-end manner without relying on feature engineering. However, most approaches assume a fully observable state space, i.e. fully observable Markov Decision Process (MDP). In real-world robotics, this assumption is unpractical, because of the sensor issues such as sensors' capacity limitation and sensor noise, and the lack of knowledge about if the observation design is complete or not. These scenarios lead to Partially Observable MDP (POMDP) and need special treatment. In this paper, we propose Long-Short-Term-Memory-based Twin Delayed Deep Deterministic Policy Gradient (LSTM-TD3) by introducing a memory component to TD3, and compare its performance with other DRL algorithms in both MDPs and POMDPs. Our results demonstrate the significant advantages of the memory component in addressing POMDPs, including the ability to handle missing and noisy observation data.

翻译：深强化学习(DRL)的一个大有希望的特征是,它有能力在不依赖地貌工程的情况下,以端到端的方式学习最佳政策,然而,大多数方法都假定一个完全可观测的状态空间,即完全可观测的Markov决策程序(MDP)。在现实世界的机器人中,这一假设是不切实际的,因为传感器的容量限制和传感器噪音等传感器问题,以及缺乏关于观测设计是否完整的知识。这些情景导致部分可观测的MDP(POMDP)和需要特殊对待。在本文件中,我们提议采用长期短期基于长期的基于中期的双层延迟的深层威慑政策梯度(LSTM-TD3),方法是向TD3引入一个记忆部分,并将其性能与MDP和POMDP的其他DL算法进行比较。我们的结果表明,记忆部分在处理POMDP(POMDP)方面有很大的优势,包括处理失踪和噪音观测数据的能力。

相关内容

深度强化学习

关注 154

深度强化学习 (DRL) 是一种使用深度学习技术扩展传统强化学习方法的一种机器学习方法。传统强化学习方法的主要任务是使得主体根据从环境中获得的奖赏能够学习到最大化奖赏的行为。然而，传统无模型强化学习方法需要使用函数逼近技术使得主体能够学习出值函数或者策略。在这种情况下，深度学习强大的函数逼近能力自然成为了替代人工指定特征的最好手段并为性能更好的端到端学习的实现提供了可能。

【DeepMind】基于模型的强化学习，174页ppt，Model-Based Reinforcement Learning

专知会员服务

89+阅读 · 2021年1月12日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【强化学习轻松入门】《Reinforcement Learning 101》，Shweta Bhatt

专知会员服务

50+阅读 · 2020年1月3日