Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models while offering better accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification with promising performance at lower computational costs. We hope that our work can provide an alternative way for visual modeling and inspire further research on sparse neural architectures. The code will be publicly available at https://github.com/showlab/sparseformer
翻译:----
人类视觉识别是一种稀疏的过程,只关注一些显著的视觉线索,而不是均匀地遍历每一个细节。然而,大多数当前的视觉网络都遵循密集的范例,在均匀的方式处理每一个视觉单元(例如像素或补丁)。在本文中,我们挑战这个密集的范例,并提出了一种新方法,被称为稀疏形态器,以一种端到端的方式模仿人类的稀疏视觉识别。稀疏形态器学习使用一定数量的隐性变量来表示图像,在潜在空间中有稀疏特征采样过程,而不是在原始像素空间中处理稠密单元。因此,稀疏形态器避免了图像空间中的大部分稠密操作,具有更低的计算成本。在ImageNet分类基准数据集上的实验表明,稀疏形态器在提供更好的准确度和吞吐量折衷的同时,实现了与经典或已建立模型相当的性能。此外,我们的网络设计可以轻松扩展到视频分类,具有较低的计算成本和有希望的性能。我们希望我们的工作能够提供一种视觉建模的替代方案,并激发进一步研究稀疏神经结构。该代码将在https://github.com/showlab/sparseformer 上公开发布。