Research in action detection has grown in the recentyears, as it plays a key role in video understanding. Modelling the interactions (either spatial or temporal) between actors and their context has proven to be essential for this task. While recent works use spatial features with aggregated temporal information, this work proposes to use non-aggregated temporal information. This is done by adding an attention based method that leverages spatio-temporal interactions between elements in the scene along the clip.The main contribution of this work is the introduction of two cross attention blocks to effectively model the spatial relations and capture short range temporal interactions.Experiments on the AVA dataset show the advantages of the proposed approach that models spatio-temporal relations between relevant elements in the scene, outperforming other methods that model actor interactions with their context by +0.31 mAP.
翻译:行动探测研究近年来随着在视频理解方面发挥着关键作用,在行动探测中增加了研究,因为它在视频理解方面发挥着关键作用。模拟行为者之间及其背景之间的相互作用(空间或时间)已证明对这项任务至关重要。最近的工作使用了具有汇总时间信息的空间特征,但这项工作提议使用非汇总时间信息。这样做的方法是增加一种基于关注的方法,在片段上利用场景各元素之间的时空相互作用。这项工作的主要贡献是引入两个交叉关注块,以有效模拟空间关系并捕捉短距离时间互动。AVA数据集的经验表明,拟议方法的优点是,模型在现场相关元素之间存在时空关系,比通过+0.31 mAP模拟行为者与其背景互动的其他方法要好。