Semantic2Graph：基于图的多模态特征融合在视频中的动作分割 (Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos)

Video action segmentation and recognition tasks have been widely applied in many fields. Most previous studies employ large-scale, high computational visual models to understand videos comprehensively. However, few studies directly employ the graph model to reason about the video. The graph model provides the benefits of fewer parameters, low computational cost, a large receptive field, and flexible neighborhood message aggregation. In this paper, we present a graph-based method named Semantic2Graph, to turn the video action segmentation and recognition problem into node classification of graphs. To preserve fine-grained relations in videos, we construct the graph structure of videos at the frame-level and design three types of edges: temporal, semantic, and self-loop. We combine visual, structural, and semantic features as node attributes. Semantic edges are used to model long-term spatio-temporal relations, while the semantic features are the embedding of the label-text based on the textual prompt. A Graph Neural Networks (GNNs) model is used to learn multi-modal feature fusion. Experimental results show that Semantic2Graph achieves improvement on GTEA and 50Salads, compared to the state-of-the-art results. Multiple ablation experiments further confirm the effectiveness of semantic features in improving model performance, and semantic edges enable Semantic2Graph to capture long-term dependencies at a low cost.

翻译：摘要：视频中的动作分割和识别任务已被广泛应用于许多领域。大多数先前的研究使用大规模、高计算量的视觉模型全面理解视频。然而，很少有研究直接采用图模型来推理视频。图模型提供了参数较少、计算成本低、大的响应域和灵活的邻域消息聚合的优点。在本文中，我们提出了一种名为Semantic2Graph的基于图的方法，将视频的动作分割和识别问题转化为图的节点分类问题。我们构建了视频帧级别上的图结构，并设计了三种类型的边：时间、语义和自环。我们将视觉、结构和语义特征组合作为节点属性。语义边用于建模长期的时空关系，而语义特征则是基于文本提示的标签文本嵌入。我们使用图神经网络模型来学习多模态特征融合。实验结果表明，与先前的最优结果相比，Semantic2Graph在GTEA和50Salads数据集上都取得了提升。多个消融实验进一步验证了语义特征提高模型性能的有效性，而语义边使得Semantic2Graph能够在低成本下捕获长期依赖关系。

相关内容

MoDELS

关注 41

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

【图神经网络多模态检索】Multi-Modal Retrieval using Graph Neural Networks

专知会员服务

30+阅读 · 2020年10月9日

[NeurIPS 2020 oral] 基于因果干预的弱监督语义分割

专知会员服务

46+阅读 · 2020年10月5日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日