To properly assist humans in their needs, human activity recognition (HAR) systems need the ability to fuse information from multiple modalities. Our hypothesis is that multimodal sensors, visual and non-visual tend to provide complementary information, addressing the limitations of other modalities. In this work, we propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors, and show its robustness for MMAct and UTD-MHAD datasets. Our model is trained in two-stage, where in the first stage, each input encoder learns to effectively extract features, and in the second stage, learns to combine these individual features. We show significant improvements of 22% and 11% compared to video only and IMU only setup on UTD-MHAD dataset, and 20% and 12% on MMAct datasets. Through extensive experimentation, we show the robustness of our model on zero shot setting, and limited annotated data setting. We further compare with state-of-the-art methods that use more input modalities and show that our method outperforms significantly on the more difficult MMact dataset, and performs comparably in UTD-MHAD dataset.
翻译:为了适当帮助人类满足其需要,人类活动识别系统(HAR)需要具备从多种模式中整合信息的能力。我们的假设是,多式联运传感器、视觉和非视觉传感器往往提供补充信息,解决其他模式的局限性。在这项工作中,我们提出一个多模式框架,学习如何有效地结合RGB视频和IMU传感器的特征,并展示其对于MMAD和UTD-MAHD数据集的稳健性。我们的模型在两个阶段进行了培训,第一阶段,每个输入编码器学习有效提取特征,第二阶段,学习将这些特性结合起来。我们显示,与仅视频和IMU相比,我们显著改进了22%和11%,而仅安装在UTD-MAD数据集上,而MMAD数据集上则只有20%和12%。通过广泛的实验,我们展示了我们零射击设定模型的稳健性,并限制了附加说明的数据设置。我们进一步比较了使用更多输入模式的状态方法,并显示,我们的方法在更困难的MHAD数据集中明显优于UDD。