变革式强化学习:发展轨迹 (On Transforming Reinforcement Learning by Transformer: The Development Trajectory)

Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.

翻译：最初设计用于自然语言处理的变异器也证明计算机愿景取得了巨大成功。研究人员正在研究如何部署变异器来强化学习(RL)和基于变异器的模型在具有代表性的RL基准中表现出潜力。在本文中,我们收集和解析了通过变异器(基于变异的RL或TRL)改造RL的最新进展,以探索其发展轨迹和未来趋势。我们将现有发展分为两类:结构增强和轨迹优化,并研究TRL在机器人操纵、基于文本的游戏、导航和自主驱动方面的主要应用。关于建筑增强,这些方法考虑如何将强大的变异器结构应用于传统RL框架下的RL问题。传统的变异器和环境比深RL方法更精确得多,但我们仍然受到传统RL算法(基于变异变的RL或TR)内在缺陷的限制,例如“死三角” 。关于轨迹优化,这些方法将RL问题分为两类,我们处理RL问题, 进行测算和训练在行为变异性变异框架下的整个轨迹上的联合国家行动模型。关于建筑的变异性变异模型,这些方法能够从这个变异性变异性变的将来的模型中提取政策, 在这种变变变变动中充分地研究中利用这些变变后, 将利用这些变变变变动的将来的变异性研究能力,在这种变变能力方面,在这种变能中充分地研究能力方面,在这种变变能中可以利用这些变能方面进行。