Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from Economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on some tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks.
翻译:现代深度强化学习算法要求估算最大 Q 值, 很难在连续域中以无限的可能动作计算。 在这项工作中, 我们引入了一个新的在线和离线 RL 更新规则, 直接用经济学的灵感来模拟极值理论( EVT) 的最大化值。 通过这样做, 我们避免使用分配外行动来计算Q值, 这往往是一个很大的错误源。 我们的关键洞察力是引入一个目标, 直接估计最大加密 RL 设置中的最佳软值函数( LogSumExplace), 而不需要政策样本。 我们使用 EVT 来绘制我们的极端Q- 学习框架, 从而在首次使用离线的 Max Ent Q 学习算法, 并不明确要求访问政策或其酶。 我们的方法在 D4RL 基准中取得一贯强的性能, 以10+ 点的速度完成某些任务之前的工作, 而在SAC 和 TD3 在线管理 DMDM 任务上提供适度的改进 。