We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across the depth of transformer networks. The proposed method can obtain a substantial gain (~2%) simply using naive recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimal computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10~30% with minimal performance loss. We call our model Sliced Recursive Transformer (SReT), a novel and parameter-efficient vision transformer design that is compatible with a broad range of other designs for efficient ViT architectures. Our best model establishes significant improvement on ImageNet-1K over state-of-the-art methods while containing fewer parameters. The proposed weight sharing mechanism by sliced recursion structure allows us to build a transformer with more than 100 or even 1000 shared layers with ease while keeping a compact size (13~15M), to avoid optimization difficulties when the model is too large. The flexible scalability has shown great potential for scaling up models and constructing extremely deep vision transformers. Code is available at https://github.com/szq0214/SReT.
翻译:我们展示了对视觉变压器的简单而有效的循环操作,可以提高参数利用率,而不需要额外的参数。这是通过在变压器网络的深度中共享重量,实现的。拟议方法可以使用天真的递转操作获得大量收益(~2%),不需要特殊或尖端的知识来设计网络的原则,并且为培训程序引入了最低计算间接费用。为了减少再循环操作引起的额外计算,同时保持高精度精确度,我们提议了一种近似方法,即通过多个切分组的循环层自我注意,使成本消耗减少10~30 %, 并尽量减少性能损失。我们称之为模型的变压变压器(~2% SReT),这是一个新颖和有参数效率的视觉变压器设计,与高效VIT结构的广泛设计不相容。我们的最佳模型可以大大改进图像网络-1K,而同时减少参数。拟议的重力回压式递合机制使我们能够在100甚至1000个共享的层上构建一个变压器,同时可以轻松地保持一个巨大的变压模型(1315M),在深度的变压模型可以避免巨大的变压模型。