Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -- either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos, and to temporally localize actions at test time. To the best of our knowledge, we are the first to propose a weakly-supervised, one/few-shot action localization network that can be trained in an end-to-end fashion. Experimental results on THUMOS14 and ActivityNet1.2 datasets, show that our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.
翻译:长期、 杂乱和未剪辑的视频中学习将行动本地化是一项艰巨的任务,在文献中,通常在假设每个班级都有大量附加说明的培训样本的情况下 -- -- 要么在完全监督的环境中(在行动界限为人所知),要么在低监督的环境下(每个视频只知道类标签)学习。在本文中,我们进一步走一步,显示在未剪辑的视频中学习将行动本地化是可能的,(a) 测试时只提供一个/剪辑的具体目标行动实例;以及(b) 大量收集的带标签(有些剪辑,有些标记不全,有些标记不全);或者在培训和测试期间使用的班之间没有重叠。为了做到这一点,我们建议建立一个网络,学会估计温度相似性(TSTRM),在测试时对一对一对本地视频进行微缩缩缩缩缩缩缩缩缩缩缩缩缩图(TTCAM), 用来在视频的演示过程中展示一个最高级的动作。