We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We re-evaluated principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80\% ImageNet top-1 accuracy, LeViT is 3.3 times faster than EfficientNet on the CPU.
翻译:我们设计了一系列图像分类结构,优化高速系统精确度和效率之间的权衡。我们的工作利用了在高度平行处理硬件上具有竞争力的基于关注的建筑中最近发现的结果。我们重新评价了关于革命神经网络的广泛文献中的原则,将其应用于变压器,特别是用分辨率递减的地图进行激活。我们还引入了关注偏差,这是将定位信息纳入视觉变压器的新方法。结果,我们提议Levit:建立一个混合神经网络,用于快速推断图像分类。我们考虑不同硬件平台上的不同效率衡量标准,以便最好地反映广泛的应用情景。我们的广泛实验以经验验证了我们的技术选择,并表明这些选择适合大多数结构。总的来说,LViT大大超越了与速度/准确度交易量有关的现有连接网和视觉变压器。例如,在80 ⁇ 图像网络上1级的精度,LeViT比CPU上的有效网络快3.3倍。