The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets. Code is available at https://github.com/MonashAI/HVT
翻译:最近提出的视觉图像变异器(VVT)得到完全关注,在图像分类等图像识别任务上取得了大有希望的成绩。然而,目前VIT模型的例行工作是在推断期间保持一个完全长的宽度序列,这是多余的,缺乏等级代表。为此,我们建议使用一个等级式视觉变异器(HVT),逐步将视觉符号集合在一起,以缩短序列长度,从而降低计算成本,这类似于Convolutional Neal网络(CNNs)下映的地貌地图。它带来巨大的好处,即我们可以通过测量深度/宽度/分辨率/分解度/分量大小来增加模型能力,而不会由于序列长度的缩短而引入额外的计算复杂性。此外,我们从经验上发现,平均集合的视觉符号含有比单级符号更具有歧视性的信息。为了显示我们的HVT的可缩放性,我们在图像分类任务上进行广泛的实验。有可比性的FLOPs,我们的HVT超越了图像网和CIFAR-100数据集的竞争性基线。代码可在 https://H.Must/M.VUB/Mont.code。