Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks. Progress has primarily been driven by the use of large-scale, annotated training data in a supervised manner. In this work, we tackle the problem of learning \textit{actor-centered} representations through the notion of continual hierarchical predictive learning to localize actions in streaming videos without any training annotations. Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework driven by the notion of hierarchical predictive learning to construct actor-centered features by attention-based contextualization. Extensive experiments on three benchmark datasets show that the approach can learn robust representations for localizing actions using only one epoch of training, i.e., we train the model continually in streaming fashion - one frame at a time, with a single pass through training videos. We show that the proposed approach outperforms unsupervised and weakly supervised baselines while offering competitive performance to fully supervised approaches. Finally, we show that the proposed model can generalize to out-of-domain data without significant loss in performance without any finetuning for both the recognition and localization tasks.
翻译:在这项工作中,我们通过连续的等级预测学习概念,在没有任何培训说明的情况下,将视频流中的行动本地化。在对事件认知理论的启发下,我们提出了一个由分级预测学习概念驱动的新颖、自我监督的框架,以便通过关注背景化来构建以行为体为中心的特征。关于三个基准数据集的广泛实验表明,该方法可以只用一个小节的培训,即我们不断用流时式培训模型,一个框架,一个框架,一个框架,一个通过培训视频传递。我们显示,拟议的方法超越了不统一和监管薄弱的基线,同时为充分监督的方法提供了竞争性的绩效。最后,我们表明,拟议的模型可以概括到外部数据,而不在不给地方性工作带来重大损失的情况下,在不给地方性工作带来任何微调。