Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. This leads to particularly severe extrapolation when our candidate policies diverge from one that generated the data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates. Over a comprehensive set of 32 continuous-action batch RL benchmarks, our approach compares favorably to state-of-the-art methods, regardless of how the offline data were collected.
翻译:重新运用过多的实验来学习好的行动, 当前的加强学习算法在现实世界环境中的适用性有限, 可能太昂贵, 无法进行勘探。 我们为批量的RL提出一个算法, 因为在批量中只使用固定的离线数据集而不是与环境的在线互动来学习有效的政策。 批量的RL数据在培训数据中代表不足的状态/行动的价值估计方面产生了内在的不确定性。 这导致当我们的候选政策与生成数据的政策不同时特别严重的外推法。 我们提议通过两种直接的惩罚来缓解这一问题: 减少这种差异的政策约束和抑制过分乐观估计的价值约束。 在一套32个连续操作的批量RL基准中,我们的方法与最新的方法相比是有利的,不管离线数据是如何收集的。