变换器中的变换器作为深强化学习的后骨 (Transformer in Transformer as Backbone for Deep Reinforcement Learning)

from arxiv, As far as we know, TIT is the first pure Transformer-based backbone for deep online and offline RL, and it also extends the offline SL paradigm proposed by Decision Transformer

Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work focuses on the former. Previous methods build the network with several modules like CNN, LSTM and Attention. Recent methods combine the Transformer with these modules for better performance. However, it requires tedious optimization skills to train a network composed of mixed modules, making these methods inconvenient to be used in practice. In this paper, we propose to design \emph{pure Transformer-based networks} for deep RL, aiming at providing off-the-shelf backbones for both the online and offline settings. Specifically, the Transformer in Transformer (TIT) backbone is proposed, which cascades two Transformers in a very natural way: the inner one is used to process a single observation, while the outer one is responsible for processing the observation history; combining both is expected to extract spatial-temporal representations for good decision-making. Experiments show that TIT can achieve satisfactory performance in different settings, consistently.

翻译：设计更好的深层网络和更好的强化学习算法对于深层RL都很重要。这项工作以前者为重点。先前的方法是用CNN、 LSTM和注意力等多个模块构建网络的。最近的方法是将变换器与这些模块结合起来, 以便提高性能。但是, 它需要枯燥的优化技能来培训由混合模块组成的网络, 使得这些方法难以在实践中使用。在本文中, 我们提议为深层RL设计 \ emph{ pure Terverer-broduct 网络}, 目的是为在线和离线设置提供现成的骨干。具体地说, 变换器主干器( TIT) 以非常自然的方式将两个变换器连成: 内变换器用来处理单一的观测, 而外加外加负责处理观察历史; 将这两种方法组合起来是为了为良好的决策提取空间时空表达方式。实验显示, TIT 可以在不同环境中持续地取得令人满意的性能。