The task of weakly supervised temporal action localization targets at generating temporal boundaries for actions of interest, meanwhile the action category should also be classified. Pseudo-label-based methods, which serve as an effective solution, have been widely studied recently. However, existing methods generate pseudo labels during training and make predictions during testing under different pipelines or settings, resulting in a gap between training and testing. In this paper, we propose to generate high-quality pseudo labels from the predicted action boundaries. Nevertheless, we note that existing post-processing, like NMS, would lead to information loss, which is insufficient to generate high-quality action boundaries. More importantly, transforming action boundaries into pseudo labels is quite challenging, since the predicted action instances are generally overlapped and have different confidence scores. Besides, the generated pseudo-labels can be fluctuating and inaccurate at the early stage of training. It might repeatedly strengthen the false predictions if there is no mechanism to conduct self-correction. To tackle these issues, we come up with an effective pipeline for learning better pseudo labels. Firstly, we propose a Gaussian weighted fusion module to preserve information of action instances and obtain high-quality action boundaries. Second, we formulate the pseudo-label generation as an optimization problem under the constraints in terms of the confidence scores of action instances. Finally, we introduce the idea of $\Delta$ pseudo labels, which enables the model with the ability of self-correction. Our method achieves superior performance to existing methods on two benchmarks, THUMOS14 and ActivityNet1.3, achieving gains of 1.9\% on THUMOS14 and 3.7\% on ActivityNet1.3 in terms of average mAP.
翻译:弱监督时间动作定位任务旨在为感兴趣的动作生成时间边界,同时还应对动作类别进行分类。 伪标签方法作为一种有效的解决方案,近年来得到广泛研究。 然而,现有方法在训练期间生成伪标签,在测试期间在不同的流水线或设置下进行预测,导致训练和测试之间存在差距。在本文中,我们提出了一种从预测的动作边界生成高质量伪标签的方法。然而,我们注意到,现有的后处理方法,如非极大值抑制(NMS),会导致信息损失,这对于生成高质量的动作边界是不足的。 更重要的是,将动作边界转化为伪标签相当具有挑战性,因为预测的动作实例一般会重叠并具有不同的置信度分数。此外,在训练的早期阶段,生成的伪标签可能会波动和不准确。如果没有进行自我修正的机制,它可能会不断加强错误预测。为了解决这些问题,我们提出了一种有效的管道,以学习更好的伪标签。首先,我们提出了一个高斯加权融合模块,以保留动作实例的信息并获得高质量的动作边界。第二,我们将伪标签生成公式化为一个优化问题,在置信度得分方面加以约束。最后,我们引入了“ Δ 伪标签”的概念,使模型具备自我修正的能力。我们的方法在两个基准测试中(THUMOS14和ActivityNet1.3)实现了卓越的性能,平均mAP在THUMOS14上提高了1.9%,在ActivityNet1.3上提高了3.7%。