In weakly-supervised temporal action localization (WS-TAL), the methods commonly follow the "localization by classification" procedure, which uses the snippet predictions to form video class scores and then optimizes a video classification loss. In this procedure, the snippet predictions (or snippet attention weights) are used to separate foreground and background. However, the snippet predictions are usually inaccurate due to absence of frame-wise labels, and then the overall performance is hindered. In this paper, we propose a novel C$^3$BN to achieve robust snippet predictions. C$^3$BN includes two key designs by exploring the inherent characteristics of video data. First, because of the natural continuity of adjacent snippets, we propose a micro data augmentation strategy to increase the diversity of snippets with convex combination of adjacent snippets. Second, we propose a macro-micro consistency regularization strategy to force the model to be invariant (or equivariant) to the transformations of snippets with respect to video semantics, snippet predictions and snippet features. Experimental results demonstrate the effectiveness of our proposed method on top of baselines for the WS-TAL tasks with video-level and point-level supervision.
翻译:在受微弱监督的时间行动定位(WS-TAL)中,通常采用的方法通常遵循“通过分类定位本地化”程序,即使用片段预测来形成视频类评分,然后优化视频分类损失。在这一程序中,使用片段预测(或片段关注重量)来区分前台和背景。然而,由于缺少框架标签,片段预测通常不准确,然后整体性能受到阻碍。在本文件中,我们提议了一个新的C$3$BN,以实现稳健的片段预测。C$3$BN包括通过探索视频数据的内在特性,包括两个关键设计。首先,由于相邻片段的自然连续性,我们提议了一个微数据增强战略,以增加相邻片段的多样性,同时结合相邻的片段。第二,我们提议了一个宏观微观-微观一致性调整战略,以迫使模型在视频语义、片段预测和片段特征方面对片段的转变(或等价),以探索两种关键设计。首先,我们提出的SWSWS-TA级基准的实验结果显示我们的拟议方法的效力。