Many problems in RL, such as meta-RL, robust RL, generalization in RL, and temporal credit assignment, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory-based architectures, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions can often yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques. We compare to 21 environments from 6 prior specialized methods and find that our implementation achieves greater sample efficiency and asymptotic performance than these methods on 18/21 environments. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs.
翻译:在RL的许多问题,如Met-RL、强力RL、在RL的普及和时间信用分配等,可以被描绘为POMDPs。理论上,仅仅通过以记忆为基础的结构,例如经常性神经网络,来增加无模式RL,就提供了解决所有类型POMDP的通用方法。然而,先前的工作发现,这种无模式的经常性RL方法往往比为特定类型的POMDP设计的更专业化的算法更差。本文再次讨论这一说法。我们发现,谨慎的建筑和超参数决定往往能够产生一种与(有时甚至比)较尖端的最近技术一样的经常性无模式执行。我们比较了以前6种专门方法中的21种环境,发现我们的实施方式比18/21年环境中的这些方法取得了更高的抽样效率和无症状性表现。我们还简单而有效地实施了经常性无模式RL,供今后的工作用作POMDP的基线。