Previous group activity recognition approaches were limited to reasoning using human relations or finding important subgroups and tended to ignore indispensable group composition and human-object interactions. This absence makes a partial interpretation of the scene and increases the interference of irrelevant actions on the results. Therefore, we propose our DynamicFormer with Dynamic composition Module (DcM) and Dynamic interaction Module (DiM) to model relations and locations of persons and discriminate the contribution of participants, respectively. Our findings on group composition and human-object interaction inspire our core idea. Group composition tells us the location of people and their relations inside the group, while interaction reflects the relation between humans and objects outside the group. We utilize spatial and temporal encoders in DcM to model our dynamic composition and build DiM to explore interaction with a novel GCN, which has a transformer inside to consider the temporal neighbors of human/object. Also, a Multi-level Dynamic Integration is employed to integrate features from different levels. We conduct extensive experiments on two public datasets and show that our method achieves state-of-the-art.
翻译:暂无翻译