MuZero, a model-based reinforcement learning algorithm that uses a value equivalent dynamics model, achieved state-of-the-art performance in Chess, Shogi and the game of Go. In contrast to standard forward dynamics models that predict a full next state, value equivalent models are trained to predict a future value, thereby emphasizing value relevant information in the representations. While value equivalent models have shown strong empirical success, there is no research yet that visualizes and investigates what types of representations these models actually learn. Therefore, in this paper we visualize the latent representation of MuZero agents. We find that action trajectories may diverge between observation embeddings and internal state transition dynamics, which could lead to instability during planning. Based on this insight, we propose two regularization techniques to stabilize MuZero's performance. Additionally, we provide an open-source implementation of MuZero along with an interactive visualizer of learned representations, which may aid further investigation of value equivalent algorithms.
翻译:Muzero是一种基于模型的强化学习算法,它使用一种等值动态模型,在Chess、Shogi和Go游戏中实现了最先进的性能。与预测下一个完整状态的标准前方动态模型相比,价值等值模型经过培训,可以预测未来价值,从而在陈述中强调相关的价值信息。虽然等值模型已经显示出巨大的经验成功,但还没有研究能够想象和调查这些模型实际学习的表现形式类型。因此,我们在本文件中设想了Muzero代理商的潜在代表性。我们发现,行动轨迹可能会在观测嵌入和内部状态过渡动态之间出现差异,这可能导致规划期间的不稳定。我们根据这一观察,建议采用两种正规化技术来稳定Muzero的绩效。此外,我们提供一种开放源的 Muzero实施方法,同时提供一种互动的直观演示工具,这可能有助于进一步调查等值算法。