Although hierarchical structures are popular in recent vision transformers, they require sophisticated designs and massive datasets to work well. In this work, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture with minor code changes upon the original vision transformer and obtains improved performance compared to existing methods. Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization. For example, a NesT with 68M parameters trained on ImageNet for 100/300 epochs achieves $82.3\%/83.8\%$ accuracy evaluated on $224\times 224$ image size, outperforming previous methods with up to $57\%$ parameter reduction. Training a NesT with 6M parameters from scratch on CIFAR10 achieves $96\%$ accuracy using a single GPU, setting a new state of the art for vision transformers. Beyond image classification, we extend the key idea to image generation and show NesT leads to a strong decoder that is 8$\times$ faster than previous transformer based generators. Furthermore, we also propose a novel method for visually interpreting the learned model.
翻译:尽管等级结构在最近的视觉变压器中很受欢迎,但它们需要先进的设计和庞大的数据集才能很好地发挥作用。 在这项工作中,我们探索了将基本的本地变压器嵌入非重叠图像区块,并以等级方式将这些变压器嵌入非重叠图像区块,我们发现区块聚合功能在促成跨区非本地信息通信方面发挥着关键作用。通过这一观察,我们设计了一个简化的架构,对原始视觉变压器略作编码修改,并比现有方法提高了性能。我们的经验结果表明,拟议的NesT方法比现有方法要快得多,需要更快的组合和大量的数据。例如,一个具有68M参数的NesT,在100/300epops的图像网上培训了68M参数,实现了82.3-83.8 $的精确度,在224\time 224美元图像规模上评估了224\time 224$的准确度,比以前的方法减少了57 $的参数减少值。 培训一个具有6M参数的NesT,比CIFAR10的刮得6M参数的精确度达到96 $ 美元。我们的精确度,用一个单一的GPUPUP,为视觉变压器设置的新状态,为新的艺术变压。除了图像分类之外,我们把关键的想法扩大到的变压值扩大到的计算还显示了比新的方法。