Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. There are many real-world challenges in those datasets, such as composite action, co-occurring action, and high temporal variation of instance duration. For handling these challenges, we propose to explore both the class and temporal relations of detected actions. In this work, we introduce an end-to-end network: Class-Temporal Relational Network (CTRN). It contains three key components: (1) The Representation Transform Module filters the class-specific features from the mixed representations to build graph-structured data. (2) The Class-Temporal Module models the class and temporal relations in a sequential manner. (3) G-classifier leverages the privileged knowledge of the snippet-wise co-occurring action pairs to further improve the co-occurring action detection. We evaluate CTRN on three challenging densely labelled datasets and achieve state-of-the-art performance, reflecting the effectiveness and robustness of our method.
翻译:行动探测是一项重要而具有挑战性的任务,特别是对于未剪辑的录像中贴有密集标签的数据集而言,行动探测是一项重要而艰巨的任务。在这些数据集中存在许多现实世界的挑战,例如综合行动、共同行动、以及高时间的试样持续时间变化。为了应对这些挑战,我们提议探讨所检测到的行动的等级和时间关系。在这项工作中,我们引入了一个端到端的网络:等级-时际关系网络。它包含三个关键组成部分:(1) 代表变换模块从混合表达式中过滤特定类别的特点,以建立图表结构化的数据。 (2) 等级-临时模块以顺序方式模拟等级和时间关系。(3) G分类化工具利用偶数共行动组合的特有知识,以进一步改进共发行动探测。我们评估了三个具有挑战性的密集标签数据集,并实现了最新业绩,反映了我们方法的有效性和稳健性。