动态视觉识别光谱和焦点网络 (Glance and Focus Networks for Dynamic Visual Recognition)

from arxiv, Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). Journal version of arXiv:2010.05300 (NeurIPS 2020). The first two authors contributed equally

Spatial redundancy widely exists in visual recognition tasks, i.e., discriminative features in an image or video frame usually correspond to only a subset of pixels, while the remaining regions are irrelevant to the task at hand. Therefore, static models which process all the pixels with an equal amount of computation result in considerable redundancy in terms of time and space consumption. In this paper, we formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system. Specifically, the proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features. The sequential process naturally facilitates adaptive inference at test time, as it can be terminated once the model is sufficiently confident about its prediction, avoiding further redundant computation. It is worth noting that the problem of locating discriminant regions in our model is formulated as a reinforcement learning task, thus requiring no additional manual annotations other than classification labels. GFNet is general and flexible as it is compatible with any off-the-shelf backbone models (such as MobileNets, EfficientNets and TSM), which can be conveniently deployed as the feature extractor. Extensive experiments on a variety of image classification and video recognition tasks and with various backbone models demonstrate the remarkable efficiency of our method. For example, it reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 1.3x without sacrificing accuracy. Code and pre-trained models are available at https://github.com/blackfeather-wang/GFNet-Pytorch.

翻译：视觉识别任务中广泛存在空间冗余,即图像或视频框架中的歧视性特征通常只相当于像素子子集,而其余区域则与手头的任务无关。因此,以同等数量计算处理所有像素的静态模型在时间和空间消耗方面造成相当的冗余。在本文件中,我们将图像识别问题设计成一个连续粗向软体特征学习过程,模仿人类视觉系统。具体地说,拟议的Glance和焦点网络(GFNet)首先以低分辨率标度提取输入图像的快速全球表示,然后从战略角度关注一系列精度(小)区域学习精度特征。因此,顺序过程自然有利于在测试时间适应所有像素的变异性,因为一旦模型对其预测足够自信,就可以终止这种图像识别问题,避免进一步的冗余计算。值得注意的是,在我们的模型中定位相干区域是一个强化学习任务,因此除了分类标签之外不需要额外的手动说明。 GFNet(GFNet)是通用的和灵活的,因为它与Sloi-Net的精度模型不易操作,可以用来在任何移动-roidal-halalal lial libal libal listral listral liforal lixal list list list listal lical lical ligalation ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex exm exmalgistration exmmex ex ex ex ex ex exm exm ex ex exmexmexmexmexm ex ex exm exm exmexmmal exm exm exm exm exm exm exm exp exmmmmmmmmmmal exmal ex exmmal exmmmmmmmmmal exmal ex ex ex ex exmal ex ex ex exmal ex ex ex ex ex ex exal ex