In partially observable reinforcement learning, offline training gives access to latent information which is not available during online training and/or execution, such as the system state. Asymmetric actor-critic methods exploit such information by training a history-based policy via a state-based critic. However, many asymmetric methods lack theoretical foundation, and are only evaluated on limited domains. We examine the theory of asymmetric actor-critic methods which use state-based critics, and expose fundamental issues which undermine the validity of a common variant, and its ability to address high partial observability. We propose an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound, maintaining the validity of the policy gradient theorem, and introducing no bias and relatively low variance into the training process. An empirical evaluation performed on domains which exhibit significant partial observability confirms our analysis, and shows the unbiased asymmetric actor-critic converges to better policies and/or faster than symmetric actor-critic and standard asymmetric actor-critic baselines.
翻译:在部分可见的强化学习中,离线培训提供了获得在线培训和(或)执行期间无法获得的潜在信息的途径,如系统状态等。非对称行为体-批评方法通过州级批评家培训历史政策来利用这些信息。然而,许多不对称方法缺乏理论基础,仅在有限领域进行评估。我们研究了不对称行为体-批评方法的理论,这些理论使用州级批评者,暴露了破坏共同变量有效性和其处理高度局部易懂性能力的根本问题。我们提出了一种公正的不对称行为体-批评变量,它既能利用国家信息,又在理论上保持健全,保持政策梯度定理的有效性,在培训过程中不引入偏差和相对低的差异。在明显部分易懂的领域进行的经验评估证实了我们的分析,并展示了不偏对称的行为体-批评者-批评者与标准非对称行为体-批评基线之间的不偏向更佳政策和(或)快于对称行为体-批评和标准对称的对称行为体-批评基线。