Reinforcement Learning (RL) requires a large amount of exploration especially in sparse-reward settings. Imitation Learning (IL) can learn from expert demonstrations without exploration, but it never exceeds the expert's performance and is also vulnerable to distributional shift between demonstration and execution. In this paper, we radically unify RL and IL based on Free Energy Principle (FEP). FEP is a unified Bayesian theory of the brain that explains perception, action and model learning by a common fundamental principle. We present a theoretical extension of FEP and derive an algorithm in which an agent learns the world model that internalizes expert demonstrations and at the same time uses the model to infer the current and future states and actions that maximize rewards. The algorithm thus reduces exploration costs by partially imitating experts as well as maximizing its return in a seamless way, resulting in a higher performance than the suboptimal expert. Our experimental results show that this approach is promising in visual control tasks especially in sparse-reward environments.
翻译:强化学习(RL) 需要大量的探索, 特别是在稀薄的回报环境中。 模拟学习( IL) 可以在不进行探索的情况下从专家演示中学习, 但它从未超过专家的绩效, 而且也容易在演示和执行之间发生分配变化。 在本文中, 我们根据自由能源原则( FEP) 彻底统一了RL 和 IL 。 FEP 是一个统一的Bayesian 大脑理论, 以共同的基本原则解释认知、 行动和模式学习 。 我们展示了 FEP 的理论扩展, 并产生了一种算法, 使一个代理学习专家演示内部化的世界模型, 同时使用该模型来推断当前和未来的状态和行动, 从而通过部分模仿专家来降低勘探成本, 并且以顺畅的方式最大限度地增加其回报。 我们的实验结果表明, 这种方法在视觉控制任务中很有希望, 特别是在微弱的环境下。