A promising characteristic of Deep Reinforcement Learning (DRL) is its capability to learn optimal policy in an end-to-end manner without relying on feature engineering. However, most approaches assume a fully observable state space, i.e. fully observable Markov Decision Processes (MDPs). In real-world robotics, this assumption is unpractical, because of issues such as sensor sensitivity limitations and sensor noise, and the lack of knowledge about whether the observation design is complete or not. These scenarios lead to Partially Observable MDPs (POMDPs). In this paper, we propose Long-Short-Term-Memory-based Twin Delayed Deep Deterministic Policy Gradient (LSTM-TD3) by introducing a memory component to TD3, and compare its performance with other DRL algorithms in both MDPs and POMDPs. Our results demonstrate the significant advantages of the memory component in addressing POMDPs, including the ability to handle missing and noisy observation data.
翻译:深强化学习(DRL)的一个大有希望的特征是,它有能力在不依赖地貌工程的情况下,以端到端的方式学习最佳政策,然而,大多数方法都假定了完全可观测的状态空间,即完全可观测的Markov决策程序(MDPs ) 。在现实世界的机器人中,这一假设是不切实际的,因为诸如感应灵敏度限制和感应噪音等问题,以及缺乏关于观测设计是否完整的知识。这些情景导致部分可观测的MDPs(POMDPs ) 。在本文中,我们建议采用长期短期基于短期的双流、基于长期的代间、延迟的深层威慑政策分级(LSTM-TD3),方法是在TD3中引入一个记忆部分,并将其性能与MDPs和POMDPs的其他DRL算法进行比较。我们的结果表明,记忆部分在解决POMDPs方面有很大的优势,包括处理失踪和噪音观测数据的能力。