Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to convolutional neural network (CNN)-based models. However, ViTs are mainly designed for image classification that generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness. Those approaches enabled HRViT to push the Pareto frontier of performance and efficiency on semantic segmentation to a new level, as our evaluation results on ADE20K and Cityscapes show. HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction, demonstrating the potential of HRViT as a strong vision backbone for semantic segmentation.
翻译:与以进化神经网络(CNN)为基础的模型相比,在计算机视觉任务方面出现了高性能的视觉变异器(ViVTs),与以进化神经网络(CNN)为基础的模型相比,在计算机视觉任务方面表现优异,但是,ViTs主要设计为图像分类,产生单一规模的低分辨率表示,使静语分解等密集的预测任务对ViTs具有挑战性。因此,我们提议HRViT,通过将高分辨率多处结构与ViTs结合起来,使ViT在多层次的计算机视觉任务方面表现优异。我们通过各种分支组合的组合组合组合组合组合组合组合技术来平衡HRViT的模型性能和效率。具体地说,我们探索差异性分支设计,减少线性层的冗余,并增强关注范围。这些方法使HRVIT能够将语系分化的性能和效率的Pareto边界推向一个新的层次,我们关于ADE20K和城市景观的评价结果显示。HR20K和21节流流流流的递递递递递递递递的B%的基础, 递化的MI 递化为平均地标的MI 20的递增的MI。