The task of Group Activity Recognition (GAR) aims to predict the activity category of the group by learning the actor spatial-temporal interaction relation in the group. Therefore, an effective actor relation learning method is crucial for the GAR task. The previous works mainly learn the interaction relation by the well-designed GCNs or Transformers. For example, to infer the actor interaction relation, GCNs need a learnable adjacency, and Transformers need to calculate the self-attention. Although the above methods can model the interaction relation effectively, they also increase the complexity of the model (the number of parameters and computations). In this paper, we design a novel MLP-based method for Actor Interaction Relation learning (MLP-AIR) in GAR. Compared with GCNs and Transformers, our method has a competitive but conceptually and technically simple alternative, significantly reducing the complexity. Specifically, MLP-AIR includes three sub-modules: MLP-based Spatial relation modeling module (MLP-S), MLP-based Temporal relation modeling module (MLP-T), and MLP-based Relation refining module (MLP-R). MLP-S is used to model the spatial relation between different actors in each frame. MLP-T is used to model the temporal relation between different frames for each actor. MLP-R is used further to refine the relation between different dimensions of relation features to improve the feature's expression ability. To evaluate the MLP-AIR, we conduct extensive experiments on two widely used benchmarks, including the Volleyball and Collective Activity datasets. Experimental results demonstrate that MLP-AIR can get competitive results but with low complexity.
翻译:群体活动识别(GAR)的任务旨在通过学习群组内演员的时空交互关系来预测群组的活动类别。因此,一种有效的演员关系学习方法对于GAR任务至关重要。以往的工作主要通过设计良好的GCN或Transformer等算法来学习交互关系。例如,为了推断演员的交互关系,GCN需要一个可学习的邻接矩阵,而Transformer需要计算自注意力。虽然以上方法可以有效建模交互关系,但也增加了模型的复杂度(参数和计算量的数量)。在本文中,我们设计了一种新颖的基于多层感知器(MLP)的演员关系学习方法(MLP-AIR),用于GAR。与GCN和Transformer相比,我们的方法具有竞争性,但技术和理论上更简单,可以显著降低复杂度。具体而言,MLP-AIR包括三个子模块:基于MLP的空间关系建模模块(MLP-S)、基于MLP的时间关系建模模块(MLP-T)和基于MLP的关系优化模块(MLP-R)。MLP-S用于建模每个帧内不同演员之间的空间关系。MLP-T用于建模每个演员在不同帧之间的时间关系。MLP-R用于进一步优化关系特征的不同维度之间的关系,提高特征的表达能力。为了评估MLP-AIR算法,在Volleyball和Collective Activity等两个广泛使用的基准数据集上进行了大量实验。实验结果表明,MLP-AIR可以获得有竞争力的结果,但其复杂度较低。