Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision, in this paper, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems from the SVRT dataset. The weight-sharing both in spatial and depth dimensions regularizes the model, allowing it to learn using far fewer free parameters, using only 28k training samples. A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture and the role of the feedback connections, which iteratively refine the internal representation until a stable prediction is obtained. In the end, this study can lay the basis for a deeper understanding of the role of attention and recurrent connections for solving visual abstract reasoning tasks.
翻译:虽然在很多视觉任务中,共生神经网络(CNNs)取得了显著成果,但它们仍然受到简单而富有挑战性的视觉推理问题的困扰。在计算机视觉中变换器网络最近取得成功的启发下,我们在本论文中引入了经常的视野变换器(RVVT)模型。由于反复连接的影响和推理任务中的空间关注,这个网络在SVRT数据集的相同视觉推理问题上取得了竞争性结果。空间和深度的权重共享使模型规范化,允许它使用少得多的自由参数学习,只使用28k个培训样本。全面融和研究证实了CNN+变换器混合结构的重要性和反馈连接的作用,这些结构反复完善了内部代表性,直到获得稳定的预测。最后,这一研究可以奠定基础,更深入地了解关注作用和反复连接对于解决视觉抽象推理任务的作用。