Model-based planning holds great promise for improving both sample efficiency and generalization in reinforcement learning (RL). We show that energy-based models (EBMs) are a promising class of models to use for model-based planning. EBMs naturally support inference of intermediate states given start and goal state distributions. We provide an online algorithm to train EBMs while interacting with the environment, and show that EBMs allow for significantly better online learning than corresponding feed-forward networks. We further show that EBMs support maximum entropy state inference and are able to generate diverse state space plans. We show that inference purely in state space - without planning actions - allows for better generalization to previously unseen obstacles in the environment and prevents the planner from exploiting the dynamics model by applying uncharacteristic action sequences. Finally, we show that online EBM training naturally leads to intentionally planned state exploration which performs significantly better than random exploration.
翻译:以模型为基础的规划对于提高强化学习的样本效率和一般化(RL)大有希望。我们显示,以能源为基础的模型(EBMs)是可用于以模型为基础的规划的有希望的模型类别。EBMs自然地支持中间国家给出的起始点和目标状态分布的推论。我们提供了一种在线算法,用于在与环境互动的同时培训EBMs,并表明EBMs允许大大优于相应的进化前向网络。我们进一步显示,EBMs支持最大导体状态的推断,并且能够产生不同的国家空间计划。我们显示,纯粹在州空间的推论――没有规划行动――可以更好地概括环境上以前所见的障碍,防止规划者通过应用非特性的行动序列来利用动态模型。最后,我们表明,在线EBM培训自然导致有意计划的国家探索,其表现比随机探索要好得多。