Offline reinforcement learning (RL) tries to learn the near-optimal policy with recorded offline experience without online exploration. Current offline RL research includes: 1) generative modeling, i.e., approximating a policy using fixed data; and 2) learning the state-action value function. While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected. In this paper, we analyze the error in generative modeling. We propose AQL (action-conditioned Q-learning), a residual generative model to reduce policy approximation error for offline RL. We show that our method can learn more accurate policy approximations in different benchmark datasets. In addition, we show that the proposed offline RL method can learn more competitive AI agents in complex control tasks under the multiplayer online battle arena (MOBA) game Honor of Kings.
翻译:离线强化学习 (RL) 试图学习近最佳政策, 记录不在线探索的离线经验。 当前的离线RL研究包括:1) 基因模型, 即使用固定数据对政策进行近似化;和2) 学习状态行动值函数。 虽然大多数研究侧重于国家行动函数, 减少培训数据分布转移引起的价值差值近似中的靴式错误, 却忽略了基因模型中错误传播的影响。 在本文中, 我们分析了基因模型中的错误。 我们提议了 AQL( 以行动为条件的Q- 学习), 这是一种用于减少离线 RL 政策近似错误的残余基因模型。 我们显示, 我们的方法可以在不同的基准数据集中学习更准确的政策近似。 此外, 我们显示, 拟议的离线 RL 方法可以在多玩家在线战场( MOBA) 游戏荣誉下, 在复杂的控制任务中学习更具竞争力的 AI 。