Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. Our approach first extracts StAR-representations by self-attending image state patches, action, and reward tokens within a short temporal window. These are then combined with pure image state representations -- extracted as convolutional features, to perform self-attention over the whole sequence. Our experiments show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, in both offline-RL and imitation learning settings. StARformer is also more compliant with longer sequences of inputs. Our code is available at https://github.com/elicassion/StARformer.
翻译:强化学习(RL)可被视为一个序列建模任务:根据过去国家行动回报经验的顺序,一个代理预测下一个行动的顺序。在这项工作中,我们为视觉RL提议国家行动回报变换器(StARexer),该变换器明确模拟短期国家行动回报表(StAR-respresentations),基本上引入了类似于Markovian的诱导偏向,以改进长期建模。我们的方法首先通过在短时间窗口内自动显示图像状态的补丁、动作和奖赏符号来提取追回被盗资产。然后,这些都与纯粹的图像表现(作为革命性特征提取)相结合,对整个序列进行自我注意。我们的实验显示,追回被盗资产变换器在离线和模拟学习环境中都比基于图像的以阿塔里和深海控制套装基准更优异。追回资产前还更符合较长的输入序列。我们的代码可在 https://githhub.com/elicasion/Strave中查阅。