连续多布连续控制批次强化学习 (Continuous Doubly Constrained Batch Reinforcement Learning)

Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. This leads to particularly severe extrapolation when our candidate policies diverge from one that generated the data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates. Over a comprehensive set of 32 continuous-action batch RL benchmarks, our approach compares favorably to state-of-the-art methods, regardless of how the offline data were collected.

翻译：重新运用过多的实验来学习好的行动, 当前的加强学习算法在现实世界环境中的适用性有限, 可能太昂贵, 无法进行勘探。我们为批量的RL提出一个算法, 因为在批量中只使用固定的离线数据集而不是与环境的在线互动来学习有效的政策。批量的RL数据在培训数据中代表不足的状态/行动的价值估计方面产生了内在的不确定性。这导致当我们的候选政策与生成数据的政策不同时特别严重的外推法。我们提议通过两种直接的惩罚来缓解这一问题: 减少这种差异的政策约束和抑制过分乐观估计的价值约束。在一套32个连续操作的批量RL基准中,我们的方法与最新的方法相比是有利的,不管离线数据是如何收集的。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

85+阅读 · 2020年2月18日

【AAAI2020教程】强化学习中的Exploration-Exploitation in Reinforcement Learning

专知会员服务

101+阅读 · 2020年2月8日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日