The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement learning algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We do this by using learned dynamics models to plan short trajectories between states. Since exact value iteration can be performed on this constructed MDP, it becomes easy to identify which trajectories are advantageous to add to the MDP. Crucially, since most transitions in this MDP come from the logged data, trajectories from the MDP can be rolled out for long periods with confidence. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. Finally, we demonstrate empirically how algorithms that uniformly constrain the learned policy to the entire dataset can result in unwanted behavior, and we show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.
翻译:脱线强化学习问题侧重于从环境互动日志中学习好的政策。 过去在这方面开发算法的努力围绕对在线强化学习算法的限制,以确保所学政策的行动受登录数据的限制。 在这项工作中,我们探索了一种替代方法,直接规划固定数据集。 具体地说, 我们引入了一种算法, 通过添加到数据集的新转换, 从而在登录数据上形成列表 Markov 决策过程( MDP ) 。 我们这样做的方法是使用学习过的动态模型来规划国家间的短期轨迹。 由于可以在这个建构的 MDP 上进行精确的重算法, 很容易确定哪些轨迹对MDP 有好处。 关键地说, 由于这个 MDP 的过渡大多来自登录数据, 因此MDP 的轨迹可以被安全地长期推出。 我们证明这个属性允许一个人在数值函数上上上下一个上限, 以适当的远程测量。 最后, 我们通过实验性地证明, 如何将所学的轨迹一致地限制 政策 避免了我们所学的MDP 。