Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at https://github.com/alibaba-mmai-research/MoLo.
翻译:当前,针对少样本动作识别问题,最先进的方法通过对学习的视觉特征进行帧级匹配,取得了良好的性能。但是,它们通常存在两个限制:i)由于缺乏引导来强制长时间范围内的时间感知,局部帧之间的匹配过程往往不准确;ii)通常忽略了明确的动作学习,导致部分信息丢失。为了解决这些问题,我们开发了一种基于动作增强的长短对比学习(MoLo)方法,该方法包含两个关键组件,包括长短对比目标和运动自编码器。具体来说,长短对比目标是通过最大化同一类别视频的全球令牌与局部帧特征之间的一致性,为局部帧特征赋予长时间范围上下文意识。运动自编码器是一种轻量级的体系结构,用于从差分特征中重建像素运动,其显式地将网络与运动动力学嵌入其中。通过这种方式,MoLo可以同时学习长时间范围上下文和运动线索,以实现全面的少样本匹配。为了证明其有效性,我们在五个标准基准测试中评估了MoLo的性能,结果显示,MoLo优于最近先进的方法。源代码可在 https://github.com/alibaba-mmai-research/MoLo 获取。