Most of existing video action recognition models ingest raw RGB frames. However, the raw video stream requires enormous storage and contains significant temporal redundancy. Video compression (e.g., H.264, MPEG-4) reduces superfluous information by representing the raw video stream using the concept of Group of Pictures (GOP). Each GOP is composed of the first I-frame (aka RGB image) followed by a number of P-frames, represented by motion vectors and residuals, which can be regarded and used as pre-extracted features. In this work, we 1) introduce sampling the input for the network from partially decoded videos based on the GOP-level, and 2) propose a plug-and-play mulTi-modal lEArning Module (TEAM) for training the network using information from I-frames and P-frames in an end-to-end manner. We demonstrate the superior performance of TEAM-Net compared to the baseline using RGB only. TEAM-Net also achieves the state-of-the-art performance in the area of video action recognition with partial decoding. Code is provided at https://github.com/villawang/TEAM-Net.
翻译:然而,原始视频流需要大量存储,并包含大量时间冗余; 视频压缩(例如,H.264, MPEG-4)通过使用图片组的概念代表原始视频流,减少了多余信息; 每个GOP由第一个I-框架(aka RGB图像)组成,然后是若干P框架,以运动矢量和残留物为代表,可以被视为并用作预先提取的特征; 在这项工作中,我们1 采用根据GOP级别部分解码视频为网络输入的样本; 视频压缩(例如,H.264, MPEG-4)通过使用图像组的概念代表原始视频流减少多余信息; 每个GOP由第一个I-框架(aka RGB图像)组成,然后是若干P框架(P框架),然后是一些P框架(P框架),然后是一些P框架(P框架),其表现优于仅使用RGB作为基线。 TEAM-Net还实现了部分解开视频动作识别领域的状态-艺术表现。 httpswang/swang/wth。