Temporal representation is the cornerstone of modern action detection techniques. State-of-the-art methods mostly rely on a dense anchoring scheme, where anchors are sampled uniformly over the temporal domain with a discretized grid, and then regress the accurate boundaries. In this paper, we revisit this foundational stage and introduce Recurrent Continuous Localization (RCL), which learns a fully continuous anchoring representation. Specifically, the proposed representation builds upon an explicit model conditioned with video embeddings and temporal coordinates, which ensure the capability of detecting segments with arbitrary length. To optimize the continuous representation, we develop an effective scale-invariant sampling strategy and recurrently refine the prediction in subsequent iterations. Our continuous anchoring scheme is fully differentiable, allowing to be seamlessly integrated into existing detectors, e.g., BMN and G-TAD. Extensive experiments on two benchmarks demonstrate that our continuous representation steadily surpasses other discretized counterparts by ~2% mAP. As a result, RCL achieves 52.92% mAP@0.5 on THUMOS14 and 37.65% mAP on ActivtiyNet v1.3, outperforming all existing single-model detectors.
翻译:时空代表是现代行动探测技术的基石。 最先进的方法主要依赖于密闭锚定, 即锚定在时空域上以离散的网格统一取样, 然后回归准确的边界。 在本文中, 我们重新审视这个基础阶段, 并引入经常性连续定位(RCL), 学习完全连续的定位代表。 具体地说, 拟议的代表制基于一个明确的模型, 以视频嵌入和时间坐标为条件, 以确保任意长度检测各区段的能力。 为了优化连续代表制, 我们制定了有效的规模变化取样战略, 并经常在随后的迭代中完善预测。 我们的连续锚定计划完全不同, 能够无缝地融入现有的探测器, 例如, BMN 和 G-TAD。 在两个基准上的广泛实验表明, 我们的持续代表制稳步超过其他离散的对应方 ~2% mAP。 结果, RCL 实现了52.92 % mAP@0.5, 在THUMOS14 和 37.65% mAP 关于AciivtiyNet v1.3 运行所有现有的单一探测器。