This work focuses on the task of elderly activity recognition, which is a challenging task due to the existence of individual actions and human-object interactions in elderly activities. Thus, we attempt to effectively aggregate the discriminative information of actions and interactions from both RGB videos and skeleton sequences by attentively fusing multi-modal features. Recently, some nonlinear multi-modal fusion approaches are proposed by utilizing nonlinear attention mechanism that is extended from Squeeze-and-Excitation Networks (SENet). Inspired by this, we propose a novel Expansion-Squeeze-Excitation Fusion Network (ESE-FN) to effectively address the problem of elderly activity recognition, which learns modal and channel-wise Expansion-Squeeze-Excitation (ESE) attentions for attentively fusing the multi-modal features in the modal and channel-wise ways. Furthermore, we design a new Multi-modal Loss (ML) to keep the consistency between the single-modal features and the fused multi-modal features by adding the penalty of difference between the minimum prediction losses on single modalities and the prediction loss on the fused modality. Finally, we conduct experiments on a largest-scale elderly activity dataset, i.e., ETRI-Activity3D (including 110,000+ videos, and 50+ categories), to demonstrate that the proposed ESE-FN achieves the best accuracy compared with the state-of-the-art methods. In addition, more extensive experimental results show that the proposed ESE-FN is also comparable to the other methods in terms of normal action recognition task.
翻译:这项工作侧重于识别老年人活动的任务,这是一项具有挑战性的任务,原因是在老年人活动中存在个别行动和人体物体的相互作用,因此,我们试图通过小心地发挥多式功能,有效地汇总来自RGB视频和骨架序列中的行为和相互作用的歧视性信息。最近,通过使用从Squeze-Exucation网络(SeNet)扩展的非线性关注机制,提出了一些非线性多模式融合办法。受此启发,我们提议建立一个新型的扩展-Squeze-Expuration网络(ESE-FN),以有效解决老年人活动识别问题,即通过敏锐地发挥多式功能,从RGB视频和骨架序列中学习有关行动和相互作用的歧视性信息。此外,我们设计了新的多式损失(ML),以保持单一式特征和组合式多式行动特征之间的一致性。 在单一方式上的最低预测损失和频道-D+50术语识别问题。 最后,E-SE-SE-SE-SER模型中显示最大规模的实验方式。