Since being introduced in 2020, Vision Transformers (ViT) has been steadily breaking the record for many vision tasks and are often described as ``all-you-need" to replace ConvNet. Despite that, ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In addition, recent research shows that standard ConvNet if redesigned and trained appropriately can compete favorably with ViT in terms of accuracy and scalability. In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition. Particularly, our main target is to serve for industrial product deployment, such as FPGA boards in which only standard operations are supported. Therefore, our network simply consists of 2D convolutions, without using any 3D convolution, long-range attention plugin, or Transformer blocks. While being trained with much fewer epochs (5x-10x), our backbone surpasses the methods using (2+1)D and 3D convolution, and achieve comparable results with ViT on two benchmark datasets.
翻译:自2020年推出以来,愿景变异器(VIT)一直在稳步打破许多愿景任务的记录,并经常被描述为“所有需要”以取代ConvNet。尽管如此, VIT通常都是计算、记忆消耗和对嵌入装置不友好的。此外,最近的研究表明,如果重新设计和培训得当,标准的ConvNet可以在准确性和可缩放性方面与Vitt竞争。在本文件中,我们采用了ConvNet的现代化结构来设计新的行动识别主干线。特别是,我们的主要目标是为工业产品部署服务,例如仅支持标准操作的FPGA板。因此,我们的网络只是由2D演进组成,没有使用任何3进式、长程关注插件或变换格块。我们的主干线在接受更少的环球(5x-10x)培训时,比使用2+1D和3进化法的方法要强得多,并在两个基准数据集上取得可比的结果。