The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models.
翻译:最近,Transformer和卷积设计的结合导致了模型精度和效率的稳定提高。在本文中,我们介绍了FastViT,一种混合视觉Transformer架构,其取得了最先进的时延-精度权衡。为此,我们引入了一种新的令牌混合运算符RepMixer,它是FastViT的构建块,使用结构再参数化降低了网络中的存储器访问成本,通过删除跳跃连接来实现。我们进一步应用训练时超参数和大内核卷积来提高准确性,并且经验上表明这些选择对延迟的影响非常小。我们展示了-我们的模型在移动设备上比CMT最近的最先进的混合Transformer架构快3.5倍,在相同的准确率下比EfficientNet快4.9倍,在相同的延迟下比ConvNeXt快1.9倍。在类似的延迟下,我们的模型比MobileOne在ImageNet上获得了4.2%更好的Top-1准确率。我们的模型在几个任务--图像分类、检测、分割和3D网格回归--中始终优于竞争架构,在移动设备和桌面GPU上具有显着的延迟改进。此外,我们的模型对于分布样本和碎片具有高度的鲁棒性,改进了竞争鲁棒模型。