Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer. The benefits of the proposed judiciously-selected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8$\times$ faster than previous transformer-based generators; and (3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available https://github.com/google-research/nested-transformer.
翻译:然而,在近期的视觉变压器中,等级结构很受欢迎,但是,它们需要复杂的设计和大量的数据集才能很好地发挥作用。在本文件中,我们探讨了将基本的本地变压器嵌入非重叠图像区块并按等级将其合并的想法。我们发现,区块聚合功能在促成跨区非本地信息通信方面发挥着关键作用。通过这一观察,我们设计了一个简化结构,要求原始的视觉变压器进行细小的代码修改。建议明智选择的设计的好处有三重:(1) NessT 更快地聚合,需要的训练数据要少得多,才能在图像网和像CIFAR这样的小型数据集上实现良好的通用化;(2) 当将我们的关键思想扩展到图像生成时,NesT 导致一个强大的解码器,比以前基于变压器的发电机快8美元\time;(3) 我们显示,通过我们设计的这种嵌套式的等级分解特征学习和抽象过程,可以设计出一种新颖的方法(名为GradCAT),用于直观解释学习的模型。源码代码可以查到 https://github.com/goleglegle-transtraction-transtransforystry。