PointTAD：可学习查询点的多标签时间动作检测 (PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points)

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e.g., ActivityNet, THUMOS). However, this setting might be unrealistic as different classes of actions often co-occur in practice. In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video. Multi-label TAD is more challenging as it requires for fine-grained class discrimination within a single video and precise localization of the co-occurring instances. To mitigate this issue, we extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD. Specifically, our PointTAD introduces a small set of learnable query points to represent the important frames of each action instance. This point-based representation provides a flexible mechanism to localize the discriminative frames at boundaries and as well the important frames inside the action. Moreover, we perform the action decoding process with the Multi-level Interactive Module to capture both point-level and instance-level action semantics. Finally, our PointTAD employs an end-to-end trainable framework simply based on RGB input for easy deployment. We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-mAP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-mAP metric, and also achieves promising results under the segmentation-mAP metric. Code is available at https://github.com/MCG-NJU/PointTAD.

翻译：传统的时间动作检测通常处理带有单个标签（例如ActivityNet，THUMOS）的未修剪视频中小数量的动作实例。然而，这种设置可能是不现实的，因为在实践中，不同类别的动作经常同时出现。本文关注于多标签时间动作检测，旨在定位多标签未修剪视频中的所有动作实例。多标签时间动作检测更具挑战性，因为它需要在单个视频中进行细粒度的类别区分，并精确定位同时出现的实例。为了缓解这个问题，我们扩展了传统的时间动作检测的稀疏查询方法，并提出了PointTAD的多标签时间动作检测框架。具体而言，我们的PointTAD引入了一小组可学习的查询点来表示每个动作实例的重要帧。这种基于点的表示提供了一种灵活的机制，可以在边界处定位区别性帧以及在动作内部定位重要帧。此外，我们使用多级交互模块执行动作解码过程，以捕获点级别和实例级别的动作语义。最后，我们的PointTAD采用基于RGB输入的端到端可训练框架，以便于部署。我们在两个流行的基准测试上评估了我们提出的方法，并介绍了多标签时间动作检测的新指标detection-mAP。我们的模型在detection-mAP指标下大幅优于所有先前的方法，并在segmentation-mAP指标下取得了有希望的结果。代码可在https://github.com/MCG-NJU/PointTAD 上获得。