With video-level labels, weakly supervised temporal action localization (WTAL) applies a localization-by-classification paradigm to detect and classify the action in untrimmed videos. Due to the characteristic of classification, class-specific background snippets are inevitably mis-activated to improve the discriminability of the classifier in WTAL. To alleviate the disturbance of background, existing methods try to enlarge the discrepancy between action and background through modeling background snippets with pseudo-snippet-level annotations, which largely rely on artificial hypotheticals. Distinct from the previous works, we present an adversarial learning strategy to break the limitation of mining pseudo background snippets. Concretely, the background classification loss forces the whole video to be regarded as the background by a background gradient reinforcement strategy, confusing the recognition model. Reversely, the foreground(action) loss guides the model to focus on action snippets under such conditions. As a result, competition between the two classification losses drives the model to boost its ability for action modeling. Simultaneously, a novel temporal enhancement network is designed to facilitate the model to construct temporal relation of affinity snippets based on the proposed strategy, for further improving the performance of action localization. Finally, extensive experiments conducted on THUMOS14 and ActivityNet1.2 demonstrate the effectiveness of the proposed method.
翻译:在视频标签上,监督不力的时间行动定位(WTAL)在视频级别上应用了一种逐级本地化模式,以探测和分类未剪辑的视频中的行动。由于分类的特点,特定类别的背景片不可避免地被错误地激活,以改善WTAL分类者的差异性。为缓解背景干扰,现有方法试图通过模拟背景片段来扩大行动和背景之间的差异,模拟假片段说明,这在很大程度上依赖人为的假设。与以前的工作不同,我们提出了一个对抗性学习战略,以打破采矿假背景片的限制。具体地说,背景分类损失迫使整个视频被视为背景梯度强化战略的背景背景,混淆了识别模式。相反,前地(行动)损失指导了在这种条件下注重行动片段的模式。结果是,两种分类损失之间的竞争促使了模型增强行动模型的建模能力。1.2 同时,一个新的时间增强网络旨在便利构建建立时间关系模拟模拟模拟模型,从而构建远近距离性模拟系统模拟模型,14 以拟议的地方模型为基础,在最后阶段性实验方法上改进了拟议的行动节能。