Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.
翻译:微弱监督的临时行动本地化(WTAL) 学会用分类标签来检测和分类行动实例。 多数方法广泛采用现成的分类前培训(CBP) 来生成行动本地化的视频功能。 然而,分类和本地化之间的优化目标不同,使得时间本地化的结果存在严重的不完整问题。 为了解决这个问题,无需附加说明,本文认为要从Vision-Language培训前(VLP)中提取自由行动知识,因为我们令人惊讶地看到香草VLP的本地化结果是一个过于全面的问题,只是与CBP的结果相辅相成。为了融合这种互补性,我们提议了一个新型的蒸馏-组合框架,由两个分支分别作为CBP和VLP。这个框架通过双层的替代培训战略得到优化。 具体而言,在B步骤期间,我们将CBBP处有信心的背景假标签(VLPP) ;在F步骤期间,对地面假标签的信心从VLPBP分支中提取,只是对CBP结果的补充性。 为了融合联盟化的双重实验结果,在VLPS-BISBABA中有效推广一个结果。