We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https://github.com/facebookresearch/LeViT
翻译:我们设计了一系列图像分类结构,优化高速系统精确度和效率之间的权衡。我们的工作利用了在高度平行处理硬件上具有竞争力的基于关注的建筑中最近发现的结果,我们重新审视了关于革命神经网络的广泛文献中的原则,以将其应用于变压器,特别是以分辨率递减的方式激活地图。我们还引入了关注偏差,这是将定位信息纳入视觉变压器的新方法。结果,我们提议Levit:建立一个用于快速推断图像分类的混合神经网络。我们考虑在不同硬件平台上采用不同的效率衡量标准,以便最好地反映广泛的应用情景。我们的广泛实验经验验证了我们的技术选择,并表明这些选择对大多数结构都适用。总体而言,LViViT大大超越了与速度/准确度转换器有关的现有连接网和视觉变压器。例如,在80%的图像网顶级-1精确度上,LeViT比CPU上的有效网络快5倍。我们在 https://github.com/facebreatresearch/LeVIT。我们发布了代码在http://www/levisearchearch/LeVT。