A key component of model-based reinforcement learning (RL) is a dynamics model that predicts the outcomes of actions. Errors in this predictive model can degrade the performance of model-based controllers, and complex Markov decision processes (MDPs) can present exceptionally difficult prediction problems. To mitigate this issue, we propose predictable MDP abstraction (PMA): instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space that only permits predictable, easy-to-model actions, while covering the original state-action space as much as possible. As a result, model learning becomes easier and more accurate, which allows robust, stable model-based planning or model-based RL. This transformation is learned in an unsupervised manner, before any task is specified by the user. Downstream tasks can then be solved with model-based control in a zero-shot fashion, without additional environment interactions. We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches in a range of benchmark environments. Our code and videos are available at https://seohong.me/projects/pma/
翻译:模型强化学习的一个关键组成部分是预测行动结果的动态模型。这一预测模型中的错误会降低模型控制器的性能,而复杂的Markov决策程序(MDPs)可能会产生非常困难的预测问题。为了缓解这一问题,我们建议可预测的 MDP抽象化(PMA):我们不为原始MDP培训一个预测模型,而是为一个具有学习型行动空间的改造的MDP培训一个模型,该模型只允许可预测的、容易到模型的行动,同时尽可能覆盖最初的状态行动空间。结果,模型学习变得更容易和更加准确,从而使得基于模型的控制器或基于模型的RL能够产生强大、稳定的模型规划或基于模型的RL。在用户指定任何任务之前,这种转变是以一种不受监督的方式学习的。然后,下流任务可以用基于模型的控制,在没有额外环境互动的情况下,以零速的方式解决。我们从理论上分析PMA和从经验上证明PMA在一系列基准环境中大大改进了以前以模型为基础的RL方法。我们的代码和视频可以在 httpp/hong/majustoms。