Vision Transformers (ViT) have recently demonstrated the significant potential of transformer architectures for computer vision. To what extent can image-based deep reinforcement learning also benefit from ViT architectures, as compared to standard convolutional neural network (CNN) architectures? To answer this question, we evaluate ViT training methods for image-based reinforcement learning (RL) control tasks and compare these results to a leading convolutional-network architecture method, RAD. For training the ViT encoder, we consider several recently-proposed self-supervised losses that are treated as auxiliary tasks, as well as a baseline with no additional loss terms. We find that the CNN architectures trained using RAD still generally provide superior performance. For the ViT methods, all three types of auxiliary tasks that we consider provide a benefit over plain ViT training. Furthermore, ViT reconstruction-based tasks are found to significantly outperform ViT contrastive-learning.
翻译:视觉变异器(VIT)最近展示了计算机视觉变压器结构的巨大潜力。 与标准的进化神经网络(CNN)结构相比,基于图像的深层强化学习还能在多大程度上从VIT结构中受益? 为了回答这个问题,我们评估了基于图像的强化学习(RL)控制任务VIT培训方法,并将这些结果与领先的革命网络结构方法(RAD)进行了比较。为了培训VIT编码器,我们认为最近提出的一些自我监督的损失被视为辅助任务,以及没有额外损失条件的基线。我们发现,使用RAD培训的CNN结构通常仍然提供优异的性能。对于VIT方法,我们认为所有三种类型的辅助任务都比普通VIT培训有所助益。此外,基于VIT的重建任务被认为大大优于VIT的对比学习。