Deep model-based Reinforcement Learning (RL) has the potential to substantially improve the sample-efficiency of deep RL. While various challenges have long held it back, a number of papers have recently come out reporting success with deep model-based methods. This is a great development, but the lack of a consistent metric to evaluate such methods makes it difficult to compare various approaches. For example, the common single-task sample-efficiency metric conflates improvements due to model-based learning with various other aspects, such as representation learning, making it difficult to assess true progress on model-based RL. To address this, we introduce an experimental setup to evaluate model-based behavior of RL methods, inspired by work from neuroscience on detecting model-based behavior in humans and animals. Our metric based on this setup, the Local Change Adaptation (LoCA) regret, measures how quickly an RL method adapts to a local change in the environment. Our metric can identify model-based behavior, even if the method uses a poor representation and provides insight in how close a method's behavior is from optimal model-based behavior. We use our setup to evaluate the model-based behavior of MuZero on a variation of the classic Mountain Car task.
翻译:深层基于模型的强化学习(RL)具有大幅提高深层RL样本效率的潜力。 尽管各种挑战长期阻碍着它,但最近一些论文以深层基于模型的方法报告了成功。这是一个巨大的发展,但由于缺乏评价这些方法的一致衡量标准,因此难以比较各种方法。例如,由于基于模型的学习,共同的单一任务抽样效率衡量标准组合与诸如代表性学习等其他各方面的学习而导致的改进,使得难以评估基于模型的RL的真实进展。 为了解决这个问题,我们引入了一个实验性架构,以评估基于模型的RL方法的行为模式,这得益于神经科学关于发现人类和动物基于模型的行为的工作。我们基于这一设置的衡量标准,即地方变化适应(LOCA)遗憾,衡量RL方法如何迅速适应环境的当地变化。我们的衡量标准可以确定基于模型的行为,即使该方法使用不好的表述方法,并且能够洞察出方法的行为与基于模型的最佳行为是否接近。我们利用我们的设置来评估基于模型的MUZ模型的模型模型变化。我们利用我们的模型来评估基于模型的MUZ的模型的模型。