Action localization networks are often structured as a feature encoder sub-network and a localization sub-network, where the feature encoder learns to transform an input video to features that are useful for the localization sub-network to generate reliable action proposals. While some of the encoded features may be more useful for generating action proposals, prior action localization approaches do not include any attention mechanism that enables the localization sub-network to attend more to the more important features. In this paper, we propose a novel attention mechanism, the Class Semantics-based Attention (CSA), that learns from the temporal distribution of semantics of action classes present in an input video to find the importance scores of the encoded features, which are used to provide attention to the more useful encoded features. We demonstrate on two popular action detection datasets that incorporating our novel attention mechanism provides considerable performance gains on competitive action detection models (e.g., around 6.2% improvement over BMN action detection baseline to obtain 47.5% mAP on the THUMOS-14 dataset), and a new state-of-the-art of 36.25% mAP on the ActivityNet v1.3 dataset. Further, the CSA localization model family which includes BMN-CSA, was part of the second-placed submission at the 2021 ActivityNet action localization challenge. Our attention mechanism outperforms prior self-attention modules such as the squeeze-and-excitation in action detection task. We also observe that our attention mechanism is complementary to such self-attention modules in that performance improvements are seen when both are used together.
翻译:行动本地化网络通常是一个特征编码器子网络和一个本地化模块子网络,其中,功能编码器学会将输入视频转换成对本地化子网络有用的功能,以产生可靠的行动建议。虽然一些编码功能可能更有助于产生行动建议,但先前行动本地化方法并不包含任何关注机制,使本地化子网络能够更多地关注更重要的特征。在本文中,我们建议建立一个新的关注机制,即基于语义的注意(CSA),从输入补充视频中存在的行动模块的暂时性分布中,学会将输入视频转换为对本地化子网络有用的特征,以便产生可靠的行动建议。虽然一些编码特性可能更有助于产生行动建议,但先前行动本地化方法并不包含任何关注机制,使本地化子网络能够更多地关注更为重要的特点。 与BMNM行动检测基线相比,大约62%的改进了47.5% mAP在THOS-14数据集中获得47.5% mAP的注意,而一个新的模型是36.25%的本地化功能,用于更有用的编码的加密功能,在20MASA行动系统之前的自我化机制中,在BA-A-A-listrue a a a a a acurrent action action action action aclistrudal acal acal action (在20 action action)。