Convolutional Neural Networks (CNNs) have been regarded as the go-to models for visual recognition. More recently, convolution-free networks, based on multi-head self-attention (MSA) or multi-layer perceptrons (MLPs), become more and more popular. Nevertheless, it is not trivial when utilizing these newly-minted networks for video recognition due to the large variations and complexities in video data. In this paper, we present MLP-3D networks, a novel MLP-like 3D architecture for video recognition. Specifically, the architecture consists of MLP-3D blocks, where each block contains one MLP applied across tokens (i.e., token-mixing MLP) and one MLP applied independently to each token (i.e., channel MLP). By deriving the novel grouped time mixing (GTM) operations, we equip the basic token-mixing MLP with the ability of temporal modeling. GTM divides the input tokens into several temporal groups and linearly maps the tokens in each group with the shared projection matrix. Furthermore, we devise several variants of GTM with different grouping strategies, and compose each variant in different blocks of MLP-3D network by greedy architecture search. Without the dependence on convolutions or attention mechanisms, our MLP-3D networks achieves 68.5\%/81.4\% top-1 accuracy on Something-Something V2 and Kinetics-400 datasets, respectively. Despite with fewer computations, the results are comparable to state-of-the-art widely-used 3D CNNs and video transformers. Source code is available at https://github.com/ZhaofanQiu/MLP-3D.
翻译:连锁神经网络(CNNs)被视作视觉识别的上向模式。 最近,基于多头自省(MSA)或多层透视(MLPs)的无革命网络越来越受欢迎。 然而,由于视频数据的变化和复杂性很大,在使用这些新启动的视频识别网络时,它并非微不足道。在本文中,我们展示了MLP-3D网络,这是一个新的MLP-3D类视频识别结构。具体地说,该结构由MLP-3D区块组成,其中每个区块都包含一个MLP-3Q 。基于多头自省(MSA)或多层自控(MLPs)的无革命性网络。由于视频数据的变异和复杂,我们用时间建模能力为基本代号MLP(MLP)配置了基本代号。 GTMs 将投入分成几个时间组,并用直线图图绘制了每个组的ML- 3- 数字网络上每个组的ML- dexileximal Tests 。我们用不同的变型模型和不同的变型的图案变式的图像网络设计了不同的图。