Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance. Finally, we evaluate our MorphMLP on a number of popular video benchmarks. Compared with the recent state-of-the-art models, MorphMLP significantly reduces computation but with better accuracy, e.g., MorphMLP-S only uses 50% GFLOPs of VideoSwin-T but achieves 0.9% top-1 improvement on Kinetics400, under ImageNet1K pretraining. MorphMLP-B only uses 43% GFLOPs of MViT-B but achieves 2.4% top-1 improvement on SSV2, even though MorphMLP-B is pretrained on ImageNet1K while MViT-B is pretrained on Kinetics400. Moreover, our method adapted to the image domain outperforms previous SOTA MLP-Like architectures. Code is available at https://github.com/MTLab/MorphMLP.
翻译:最近, MLP- 类似网络已经恢复了图像识别。 但是, 尚未探索是否有可能在视频域内建立通用 MLP- 类似架构。 由于复杂的空间时标模型, 且计算负担巨大, 是否有可能在视频域内建立通用 MLP- 类似架构 。 为了填补这一空白, 我们展示了一个高效的自控自由主干, 即 MorphMLP, 它可以灵活地利用简明的全结( FC) 层来进行视频演示学习。 具体地说, 一个 MorphMLP 块由两个关键层组成, 即: MorphFFC_ 和 morphFC_ t, 分别用于空间和时间建模。 MorphFCFC_ 能够通过在高度和宽度两个维度的递进化符号互动, 以渐进的方式学习框架上的长期依赖性。 我们的MLFMM1- PlickM1 的MMMLMT 只能在高级视频基准上进行大幅的MphMLP- 改进。