Temporal relational modeling in video is essential for human action understanding, such as action recognition and action segmentation. Although Graph Convolution Networks (GCNs) have shown promising advantages in relation reasoning on many tasks, it is still a challenge to apply graph convolution networks on long video sequences effectively. The main reason is that large number of nodes (i.e., video frames) makes GCNs hard to capture and model temporal relations in videos. To tackle this problem, in this paper, we introduce an effective GCN module, Dilated Temporal Graph Reasoning Module (DTGRM), designed to model temporal relations and dependencies between video frames at various time spans. In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs where the nodes represent frames from different moments in video. Moreover, to enhance temporal reasoning ability of the proposed model, an auxiliary self-supervised task is proposed to encourage the dilated temporal graph reasoning module to find and correct wrong temporal relations in videos. Our DTGRM model outperforms state-of-the-art action segmentation models on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset. The code is available at https://github.com/redwang/DTGRM.
翻译:视频中的时间关系建模对于理解人类行动至关重要,例如行动识别和行为分割。虽然图表演动网络(GCNs)在很多任务的相关推理中显示出了大有希望的优势,但有效地应用长视频序列中的图变图网络仍然是一项挑战,主要原因是大量节点(即视频框)使GCNs难以捕捉和模拟视频中的时际关系。为了解决这一问题,我们在本文件中引入了一个有效的GCN模块,即动态图解解析模块(DTGRM),旨在在不同时间跨度的视频框架之间建模时间关系和依赖关系模型。特别是,我们通过建立多层次的变相时间图来捕捉和模拟时间关系,节点代表视频中不同时刻的框架。此外,为了提高拟议模型的时间推理能力,我们提议了一项辅助性自我监督任务,以鼓励三角时间图推理推理模块在视频中发现和纠正错误的时间关系。我们的DGRMM模型超越了不同时间框架之间的时间关系和依赖性关系。我们的视频框架,我们通过建立多层次的多层次的扩展时间线图图图图图图图图图图图图图图解模型,在三个具有挑战性GEGeorgia/GRACRCRCRCRCRDRDRDRDGRDS/CRDGRDGRDMDM 3中的数据节码。