We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have non-sequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on challenging datasets: THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.
翻译:我们引入了“活动图图变”这一最终到最终可学习的时间行动定位模型,该模型接收视频作为投入,并直接预测视频中出现的一系列行动实例。在未剪辑的视频中检测和定位行动实例,要求对视频中的多个行动实例进行推理。文献过程中的主导模式在时间上处理视频,要么提出行动区域,要么直接生成框架级检测。然而,当行动实例具有非序列依赖性和/或非线性时间顺序排序时,视频的顺序处理有问题,例如重叠的行动实例或视频过程中再次出现的行动实例。在这项工作中,我们通过将视频作为非序列实体的图表形式来推理来捕捉这种非线性的时间结构。我们评估了我们关于具有挑战性的数据集的模型:THUMOS14、Charades和EPIC-Kitchens-100。我们的结果显示,我们提议的模型大大超越了艺术的状态。