Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We demonstrate our method's superior performance on a variety of autonomous driving tasks in simulation.
翻译:以变换神经网络结构为基础的自然语言处理(NLP)的显著成果激励了研究人员探索将离线强化学习(RL)作为通用序列建模问题来看待。基于这一范例的近期工作取得了最新成果,从而产生了多半决定性的离线离线Atari和D4RL基准。然而,由于这些方法共同将状态和行动作为一个单一的顺序问题来模拟,因此他们努力将政策和世界动态对回报的影响分解开来。因此,在对抗性或随机性环境中,这些方法导致过度乐观的行为,在诸如自主驾驶等安全临界系统中可能具有危险性。在这项工作中,我们提出了一种方法,通过明确区分政策和世界模型来消除这种乐观的偏向,从而使我们能够在测试时寻找对环境中多种可能的未来具有活力的政策。我们展示了我们的方法在模拟中的各种自主驱动任务上的优异性表现。