We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code will be made available at https://github.com/sauradip/DiffusionTAD.
翻译:我们提出了一种新的时间动作检测的表达方式,称为DiffTAD 通常。它接受随机的时间提议作为输入,并且在给定不规则的长视频时可以准确地产生动作提议。这提供了一个生成模型的视角,与以前的判别式学习方式相反。通过首先将真实提议扩散到随机提议(即正向噪声过程),然后学习逆转噪声过程(即反向去噪过程)来实现此功能。具体而言,我们通过引入更快的训练速度的具有时间位置查询设计的Transformer解码器,在去噪过程中建立去噪过程。我们进一步提出了一种跨步选择性调节算法,用于推理加速。在ActivityNet和THUMOS上的广泛评估表明,与以前的艺术替代方案相比,我们的DiffTAD实现了最佳性能。代码将在https://github.com/sauradip/DiffusionTAD上提供。