Long-term complex activity recognition and localisation can be crucial for decision making in autonomous systems such as smart cars and surgical robots. Here we address the problem via a novel deformable, spatiotemporal scene graph approach, consisting of three main building blocks: (i) action tube detection, (ii) the modelling of the deformable geometry of parts, and (iii) a graph convolutional network. Firstly, action tubes are detected in a series of snippets. Next, a new 3D deformable RoI pooling layer is designed for learning the flexible, deformable geometry of the constituent action tubes. Finally, a scene graph is constructed by considering all parts as nodes and connecting them based on different semantics such as order of appearance, sharing the same action label and feature similarity. We also contribute fresh temporal complex activity annotation for the recently released ROAD autonomous driving and SARAS-ESAD surgical action datasets and show the adaptability of our framework to different domains. Our method is shown to significantly outperform graph-based competitors on both augmented datasets.
翻译:长期复杂的活动识别和本地化对于智能汽车和手术机器人等自主系统的决策至关重要。 在这里,我们通过一种新型的可变、可变的时空场景图形方法来解决这个问题,该方法由三个主要构件组成:(一) 行动管探测,(二) 部件可变形几何建模模型,(三) 图形相形形形网络。首先,在一系列片段中检测到动作管。接下来,设计一个新的3D变形RoI集合层,用于学习构件动作管的灵活、可变形几何。最后,通过将所有部件都视为节点,并在外观等不同语义的基础上将其连接起来,共享相同的动作标签和特征相似。我们还为最近发布的ROAD自动驱动器和SAAS-ESAD外科动作数据集提供了新的时间复杂活动说明,并展示了我们框架对不同领域的适应性。我们的方法显示在两个增强的数据集上都明显超越了基于图形的竞争者。