Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named, D(2+1)D that decomposes a 3D depthwise convolution into two spatial and temporal depthwise convolutions for making our network more lightweight and efficient. By using the proposed temporal modeling method (T-OSA), and the efficient factorized component (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M and VoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both Something-Something and Kinetics-400. Furthermore, VoV3D shows better temporal modeling ability than a state-of-the-art efficient 3D architecture, X3D having comparable model capacity. We hope that VoV3D can serve as a baseline for efficient video classification.
翻译:最近引起注意的视频分类研究领域是时间模型和3D高效架构。然而,时间模型方法效率不高,或者3D高效架构对时间模型不那么感兴趣。为了缩小两者之间的差距,我们提议了一个高效的时间模型3D架构,称为VoV3D,由一次性一次性聚合(T-OSA)模块和深度系数化组件组成,D(2+1)D。T-OSA设计的目的是通过将不同时间允许字段的时间特征汇总到不同的时间模型中来构建一个特征等级。使用T-OSA使网络本身能够建模短程以及无任何外部模块的长距离跨框架时间关系。根据内核因子化和频道因子化的启发,我们还设计了一个由时间单发集(T-OSA)模块和深度因子化集成的3PlV2+1D 模型,将3的3D 深相向两个空间和短时序模型变曲线化模型,使我们的网络更具光度和效率。通过拟议的时间模型(T-OSA),使网络能够模拟短程和长程-4框架的短程关系,使 VO3D3-D3S3的系统-直径(VD) 和高效的网络结构结构结构结构结构结构结构结构更弱。