Localizing persons and recognizing their actions from videos is a challenging task towards high-level video understanding. Recent advances have been achieved by modeling direct pairwise relations between entities. In this paper, we take one step further, not only model direct relations between pairs but also take into account indirect higher-order relations established upon multiple elements. We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context. To this end, we design an Actor-Context-Actor Relation Network (ACAR-Net) which builds upon a novel High-order Relation Reasoning Operator and an Actor-Context Feature Bank to enable indirect relation reasoning for spatio-temporal action localization. Experiments on AVA and UCF101-24 datasets show the advantages of modeling actor-context-actor relations, and visualization of attention maps further verifies that our model is capable of finding relevant higher-order relations to support action detection. Notably, our method ranks first in the AVA-Kineticsaction localization task of ActivityNet Challenge 2020, out-performing other entries by a significant margin (+6.71mAP). Training code and models will be available at https://github.com/Siyu-C/ACAR-Net.
翻译:在本文中,我们进一步迈出一步,不仅在对对夫妇之间建立直接关系模型,而且考虑到在多个要素上建立的间接更高层次关系;我们提议以两个行为者与背景互动为基础,明确模拟演员-Context-Actor Relation关系,这是两个行为者之间的关系;为此,我们设计了一个演员-Context-Actor Relation Net(ACAR-Net)网络,这个网络以创新的《高秩序关系解释操作员》和《行为者-Text Feture Bank》为基础,为空间-时空行动本地化提供间接关系推理。关于AVA和UCFC101-24的实验展示了行为-Context-Actor关系模型的优点,以及关注地图的直观化进一步证实我们的模型能够找到相关的更高秩序关系以支持行动探测。 值得注意的是,我们的方法在AVA-Kinetical Contracational-Agreative-C commexmexmexmexional 2020, SA-Dal-Ambregresulational-Ambal-ADal-Ambreal-C) exmal exmal exmal exmal exmal exmexmal exmal exmexmexmexmexmexmexmoluts.