Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract spatiotemporal features, while frame-based video tasks, such as multiple object tracking (MOT), rely on single fixed-image backbone to extract spatial features. In contrast, we propose to unify video understanding tasks into one novel streaming video architecture, referred to as Streaming Vision Transformer (S-ViT). S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task. We believe that the concept of streaming video model and the implementation of S-ViT are solid steps towards a unified deep learning architecture for video understanding. Code will be available at https://github.com/yuzhms/Streaming-Video-Model.
翻译:视频理解任务传统上可以通过两种不同的体系结构来建模,分别针对两个不同的任务。基于序列的视频任务(例如动作识别)使用视频主干直接提取时空特征,而基于帧的视频任务(例如多目标跟踪)则依赖单一的固定图像主干提取空间特征。相比之下,我们提出将视频理解任务统一为一种新颖的流媒体视频架构,称为 Streaming Vision Transformer(S-ViT)。S-ViT 首先使用具有内存的时间感知的空间编码器生成帧级特征,以服务于基于帧的视频任务。然后,将帧特征输入任务相关的时间解码器,以获取基于序列的任务的时空特征。S-ViT 的效率和有效性可通过在基于序列的动作识别任务中达到最先进的准确性和在基于帧的 MOT 任务中的竞争优势得到证明。我们认为流媒体视频模型的概念和 S-ViT 的实现是迈向统一深度学习架构的坚实步伐。代码将在 https://github.com/yuzhms/Streaming-Video-Model 上提供。