Existing action recognition methods typically sample a few frames to represent each video to avoid the enormous computation, which often limits the recognition performance. To tackle this problem, we propose Ample and Focal Network (AFNet), which is composed of two branches to utilize more frames but with less computation. Specifically, the Ample Branch takes all input frames to obtain abundant information with condensed computation and provides the guidance for Focal Branch by the proposed Navigation Module; the Focal Branch squeezes the temporal size to only focus on the salient frames at each convolution block; in the end, the results of two branches are adaptively fused to prevent the loss of information. With this design, we can introduce more frames to the network but cost less computation. Besides, we demonstrate AFNet can utilize fewer frames while achieving higher accuracy as the dynamic selection in intermediate features enforces implicit temporal modeling. Further, we show that our method can be extended to reduce spatial redundancy with even less cost. Extensive experiments on five datasets demonstrate the effectiveness and efficiency of our method.
翻译:现有的行动识别方法通常代表每个视频的少数框架,以避免庞大的计算,这往往限制识别性能。为了解决这一问题,我们建议建立由两个分支组成的安普勒和联络网(AFNet),使用更多的框架,但计算较少。具体地说,安普勒处采用所有输入框架,以获得大量精密的计算信息,并通过拟议的导航模块为协调处提供指导;协调人处压缩时间大小,只侧重于每个卷发区中的突出框架;最后,两个分支的结果适应性地结合起来,以防止信息丢失。有了这一设计,我们可以为网络引入更多的框架,但成本较低。此外,我们证明,随着中间特征的动态选择实施隐含的时间模型,AFNet可以使用更少的框架,同时实现更高的准确性。此外,我们表明,我们的方法可以扩大,以更少的成本减少空间冗余。在五个数据集上进行的广泛试验显示了我们的方法的有效性和效率。