A primary challenge faced in few-shot action recognition is inadequate video data for training. To address this issue, current methods in this field mainly focus on devising algorithms at the feature level while little attention is paid to processing input video data. Moreover, existing frame sampling strategies may omit critical action information in temporal and spatial dimensions, which further impacts video utilization efficiency. In this paper, we propose a novel video frame sampler for few-shot action recognition to address this issue, where task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA). Specifically, our sampler first scans the whole video at a small computational cost to obtain a global perception of video frames. The TS plays its role in selecting top-T frames that contribute most significantly and subsequently. The SA emphasizes the discriminative information of each frame by amplifying critical regions with the guidance of saliency maps. We further adopt task-adaptive learning to dynamically adjust the sampling strategy according to the episode task at hand. Both the implementations of TS and SA are differentiable for end-to-end optimization, facilitating seamless integration of our proposed sampler with most few-shot action recognition methods. Extensive experiments show a significant boost in the performances on various benchmarks including long-term videos.
翻译:为解决这一问题,该领域目前的方法主要侧重于在功能层面设计算法,而很少注意处理输入视频数据。此外,现有的框架抽样战略可能省略时间和空间层面的关键行动信息,从而进一步影响视频利用效率。在本论文中,我们提议为少量行动识别提供一个新的视频框架取样器,以解决这一问题,即通过时间选择器(TS)和空间放大器(SA)实现特定任务的时空框架取样。具体地说,我们的取样员首先以小型计算成本扫描整个视频,以获得对视频框架的全球感知。TS在选择对时间和空间层面贡献最大、从而进一步影响视频利用效率的顶层框架方面发挥了作用。SA强调每个框架的歧视性信息,在突出的地图指导下扩大关键区域。我们进一步采用任务适应性学习,根据当前的情况对取样战略进行动态调整。TS和SA的实施工作在最终至终端优化方面是不同的,有助于将我们拟议的最接近于随后的图像框架进行无缝错的整合,包括将我们所提出的最大幅度的推进性实验的推进性试验方法与最大幅度的推进性试验。