It has been found that temporal action proposal generation, which aims to discover the temporal action instances within the range of the start and end frames in the untrimmed videos, can largely benefit from proper temporal and semantic context exploitation. The latest efforts were dedicated to considering the temporal context and similarity-based semantic contexts through self-attention modules. However, they still suffer from cluttered background information and limited contextual feature learning. In this paper, we propose a novel Pyramid Region-based Slot Attention (PRSlot) module to address these issues. Instead of using the similarity computation, our PRSlot module directly learns the local relations in an encoder-decoder manner and generates the representation of a local region enhanced based on the attention over input features called \textit{slot}. Specifically, upon the input snippet-level features, PRSlot module takes the target snippet as \textit{query}, its surrounding region as \textit{key} and then generates slot representations for each \textit{query-key} slot by aggregating the local snippet context with a parallel pyramid strategy. Based on PRSlot modules, we present a novel Pyramid Region-based Slot Attention Network termed PRSA-Net to learn a unified visual representation with rich temporal and semantic context for better proposal generation. Extensive experiments are conducted on two widely adopted THUMOS14 and ActivityNet-1.3 benchmarks. Our PRSA-Net outperforms other state-of-the-art methods. In particular, we improve the AR@100 from the previous best 50.67% to 56.12% for proposal generation and raise the mAP under 0.5 tIoU from 51.9\% to 58.7\% for action detection on THUMOS14. \textit{Code is available at} \url{https://github.com/handhand123/PRSA-Net}
翻译:发现时间行动建议生成旨在发现在未剪动的视频中起始和结尾框架范围内的时间行动实例{ 的时间行动建议生成 { 可在很大程度上受益于适当的时间和语义背景开发。 最近的努力致力于通过自省模块来考虑时间背景和基于相似的语义背景。 然而,它们仍然受到背景信息混杂和背景特征学习的制约。 在本文中,我们提出一个新的基于 Pyramid 区域的 Slot 注意 (PRSlot) 模块来解决这些问题。 我们的 PRSlot 模块不是使用类似计算,而是以编码- 解析器的方式直接学习本地关系。 最近的努力致力于通过关注被称为\ textit{slot} 的输入特性来提高本地区域的代表性。 具体地说, PRSlot 模块将目标缩略图作为 kext {query}, 其周围的环境是\ textualitalital {lick} 。 在我们的 PRSO- real- real- real- real dead slaveal 战略中, 将本地的Sliferal- real-real-real-real remodeal- sal- real- sal- slaveal State State State State State State a strational a legreal a legild a legleglegild a legal a strational a sual a subal subal subal- sal- sal- sal- siltal- sild a subal ad subal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal legal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal-