Tokens-to-token Vit:在图像网络上培训来自Scratch的愿景变换者 (Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet)

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, \eg, the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0\% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3\% top1 accuracy in image resolution 384$\times$384 on ImageNet. (Code: https://github.com/yitu-opensource/T2T-ViT)

翻译：用于语言建模的变换器很受欢迎, 已被探索用于最近解决图像分类的视觉任务。 ViT 模型将每个图像分割成一个固定长度的图示序列, 然后应用多个变换器层来模拟其全球分类关系。然而, ViT 在像图像网这样的中等数据集上从零到零训练成CNN 的性能较差。我们发现这是因为:1) 输入图像的简单符号化无法模拟重要的本地结构, 如相邻像素之间的边缘和线等, 导致低培训样本效率; 2 ViT 的冗余关注主干网设计导致固定计算预算及有限培训样本的有限特性丰富性能。为了克服这些局限性, 我们建议一个新的 Tokens-token ViverGreger (T2T-Viet) 包括1个层次的图象化图象化图案(T2T2T) 的简单化图案图案图案转换为图案缩图案缩图案缩图案。 3, 通过不断更新的图案化的图案缩略图案结构, 将图案化的图案缩图案结构可以实现。