Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT
翻译:由于关注机制和模型设计的复杂性,大多数现有视觉变异器(ViTs)无法在现实的工业部署情景(如TensorRT和CoreML)中像变异神经网络(CNNNNNNNNNNNNNN)那样高效地运行。这带来了一个明显的挑战:视觉神经网络能否像CNNNNNNNNNML(NCB)和ViTT的功能一样快速?最近的工作试图设计CNN- Transfer混合结构来解决这一问题,然而这些工程的总体绩效远不能令人满意。为了结束这些变化,我们提议下一代的视觉变异器变异于现实的工业情景中高效部署,即Next-VIT(NCRNNNNNNNNNNNNNNT) 。NFI.6 和NTBSD(NHSFS) 设计成一个高效的混合模型,在各种下游变异性任务中提升性变异性性变异性变异性变变变变变变变变的功能。