Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.
翻译:时间动作检测(TAD) 旨在确定语义标签和每个动作实例在未剪辑的视频中的时间间隔。 这是一个基本的、具有挑战性的视频理解任务。 以往的方法通常需要培训多个网络, 并涉及手工设计的操作, 如非最大抑制和锚定生成, 限制灵活性, 并防止端到端学习。 在本文中, 我们为TAD( 称为 TadTR) 提出了一个基于端到端变压器的转换器法。 鉴于一套小的可学习嵌入器叫做行动查询, TadTR 将每个查询的视频中的时间背景信息从视频中提取出来, 并直接根据背景预测行动实例。 要将变换器调整到塔德。 我们建议三项改进, 以提高其位置意识。 核心是一个时间变形关注模块, 有选择地关注视频中一连串的关键片块。 一个部分改进机制和动作回归头, 旨在精细现有实例的边界和信心。 有了这样一个简单的管道, TadTRTR 需要较低的计算成本, 直接从视频中预测每个查询过程 。 保持令人瞩目的的业绩。 (HAAP- ) a restal- lab- ladeal 。 a status) a a stat- lax a state a state) a status- laveal- laveal