A primary challenge faced in few-shot action recognition is inadequate video data for training. To address this issue, current methods in this field mainly focus on devising algorithms at the feature level while little attention is paid to processing input video data. Moreover, existing frame sampling strategies may omit critical action information in temporal and spatial dimensions, which further impacts video utilization efficiency. In this paper, we propose a novel video frame sampler for few-shot action recognition to address this issue, where task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA). Specifically, our sampler first scans the whole video at a small computational cost to obtain a global perception of video frames. The TS plays its role in selecting top-T frames that contribute most significantly and subsequently. The SA emphasizes the discriminative information of each frame by amplifying critical regions with the guidance of saliency maps. We further adopt task-adaptive learning to dynamically adjust the sampling strategy according to the episode task at hand. Both the implementations of TS and SA are differentiable for end-to-end optimization, facilitating seamless integration of our proposed sampler with most few-shot action recognition methods. Extensive experiments show a significant boost in the performances on various benchmarks including long-term videos.The code is available at https://github.com/R00Kie-Liu/Sampler
翻译:为解决这一问题,该领域目前的方法主要侧重于在功能层面设计算法,而很少注意处理输入视频数据。此外,现有的框架抽样战略可能省略时间和空间层面的关键行动信息,从而进一步影响视频利用效率。在本文件中,我们提议为少量行动识别提供一个新的视频框架取样器,以解决这一问题,即通过时间选择器(TS)和空间放大器(SA)实现特定任务的空间时间框架取样。具体地说,我们的取样器首先以小计算成本扫描整个视频,以获得对视频框架的全球感知。TS在选择对时间和空间层面贡献最大、从而进一步影响视频利用效率的顶端和空间层面框架方面发挥着作用。在本文中,我们建议为一些任务适应性学习以动态调整取样战略,以便根据当前的情况任务来完成。TS和SA的实施对于最终至端的优化来说都是不同的,因此,TS在选择最接近的T-T框架框架方面发挥着作用。 SA强调每个框架的歧视性信息,通过突出的地图指导来扩大关键区域。我们提出的许多次级样本/级的模拟测试中,包括现有的高级级级级的模拟测试。