Recent developments in the field of model-based RL have proven successful in a range of environments, especially ones where planning is essential. However, such successes have been limited to deterministic fully-observed environments. We present a new approach that handles stochastic and partially-observable environments. Our key insight is to use discrete autoencoders to capture the multiple possible effects of an action in a stochastic environment. We use a stochastic variant of \emph{Monte Carlo tree search} to plan over both the agent's actions and the discrete latent variables representing the environment's response. Our approach significantly outperforms an offline version of MuZero on a stochastic interpretation of chess where the opponent is considered part of the environment. We also show that our approach scales to \emph{DeepMind Lab}, a first-person 3D environment with large visual observations and partial observability.
翻译:基于模型的RL领域的最新发展在一系列环境中证明是成功的,特别是在规划至关重要的环境中。然而,这些成功仅限于确定性完全观察的环境。我们提出了一个处理随机和部分观察环境的新方法。我们的关键洞察力是使用离散自动编码器来捕捉在随机环境中采取行动的多种可能效果。我们使用\emph{Monte Carlo树搜索的随机变体来规划代理人的行动和代表环境反应的离散潜在变量。我们的方法大大超越了将对手视为环境一部分的对象棋的离线解释中的 Muzero的离线版本。我们还表明,我们对于具有大视觉观测和部分可观测性的第一个人的3D环境,我们的方法尺度是 \ emph{DepMind Lab} 。