Animals are able to rapidly infer from limited experience when sets of state action pairs have equivalent reward and transition dynamics. On the other hand, modern reinforcement learning systems must painstakingly learn through trial and error that sets of state action pairs are value equivalent -- requiring an often prohibitively large amount of samples from their environment. MDP homomorphisms have been proposed that reduce the observed MDP of an environment to an abstract MDP, which can enable more sample efficient policy learning. Consequently, impressive improvements in sample efficiency have been achieved when a suitable MDP homomorphism can be constructed a priori -- usually by exploiting a practioner's knowledge of environment symmetries. We propose a novel approach to constructing a homomorphism in discrete action spaces, which uses a partial model of environment dynamics to infer which state action pairs lead to the same state -- reducing the size of the state-action space by a factor equal to the cardinality of the action space. We call this method equivalent effect abstraction. In a gridworld setting, we demonstrate empirically that equivalent effect abstraction can improve sample efficiency in a model-free setting and planning efficiency for modelbased approaches. Furthermore, we show on cartpole that our approach outperforms an existing method for learning homomorphisms, while using 33x less training data.
翻译:动物们能够从有限的经验中迅速推断出,当几组州行动配对具有同等的奖赏和过渡动态时,几组州行动配对能够快速地从有限的经验中推断出。另一方面,现代强化学习系统必须通过试验和错误来认真学习,这几组州行动配对具有同等的价值 -- -- 需要从环境中抽取大量样本。MDP同质主义建议将观测到的环境MDP减为抽象的MDP,这样可以使政策学习更具样板效率。因此,当适当的MDP同质主义能够先验地建立起来时,抽样效率有了惊人的提高 -- -- 通常通过利用一个专家对环境相配对的环境对等性知识。我们提出了一个在离散的行动空间中构建一个同式的同质性主义的新方法。我们用一个局部的环境动态模型模型模型模型模型模型模型模型模型模型模型模型模型方法来提高样本效率和规划效率,我们用一个不那么多的模型模型模型模型模型方法来显示。