Many of the challenges facing today's reinforcement learning (RL) algorithms, such as robustness, generalization, transfer, and computational efficiency are closely related to compression. Prior work has convincingly argued why minimizing information is useful in the supervised learning setting, but standard RL algorithms lack an explicit mechanism for compression. The RL setting is unique because (1) its sequential nature allows an agent to use past information to avoid looking at future observations and (2) the agent can optimize its behavior to prefer states where decision making requires few bits. We take advantage of these properties to propose a method (RPC) for learning simple policies. This method brings together ideas from information bottlenecks, model-based RL, and bits-back coding into a simple and theoretically-justified algorithm. Our method jointly optimizes a latent-space model and policy to be self-consistent, such that the policy avoids states where the model is inaccurate. We demonstrate that our method achieves much tighter compression than prior methods, achieving up to 5x higher reward than a standard information bottleneck. We also demonstrate that our method learns policies that are more robust and generalize better to new tasks.
翻译:今天强化学习(RL)算法面临的许多挑战,例如稳健性、一般化、转移和计算效率等,都与压缩密切相关。先前的工作令人信服地论证了为什么信息最小化在受监督的学习环境中是有用的,但标准的RL算法缺乏明确的压缩机制。 RL设置是独特的,因为:(1) 其相继性质允许代理人使用过去的信息来避免未来观察,(2) 代理人可以优化其行为,而选择决策需要很少比分的国家。我们利用这些属性提出一种学习简单政策的方法(RPC)。 这种方法将信息瓶颈、基于模型的RL和回位编码集成一个简单和理论上合理的算法。 我们的方法共同优化了潜空模型和政策的自我一致性,从而避免了模型的不准确性。 我们证明,我们的方法比先前的方法更严格得多,比标准的信息瓶颈高5x的报酬。 我们还表明,我们的方法学习的政策比新任务更稳健、更概括。