The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
翻译:最近提出的视觉图像变形器(VYT)得到纯净关注,在图像分类等图像识别任务方面取得了有希望的成绩。然而,目前的ViT模型的例行工作是在推断期间保持一个完全长的宽度序列,这是多余的,缺乏等级代表。为此,我们建议使用一个等级式视觉变形器(HVT),逐步将视觉符号集合在一起,以缩短序列长度,从而降低计算成本,这类似于Convolutional Neal网络(CNNs)下映的地貌地图。它带来巨大的好处,即我们可以通过放大深度/宽度/分辨率/分解度/批量尺寸的尺寸来增加模型能力,而不会由于序列长度的缩短而引入额外的计算复杂性。此外,我们从经验上发现,平均集合的视觉符号含有比单级符号更具有歧视性的信息。为了证明我们的HVT的可缩放性,我们在图像分类任务上进行了广泛的实验。通过类似的 FLOPs,我们的HVT超越了图像网和CIFAR-100数据集的竞争性基线。