This work proposes a weakly-supervised temporal action localization framework, called D2-Net, which strives to temporally localize actions using video-level supervision. Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision. The proposed formulation comprises a discriminative and a denoising loss term for enhancing temporal action localization. The discriminative term incorporates a classification loss and utilizes a top-down attention mechanism to enhance the separability of latent foreground-background embeddings. The denoising loss term explicitly addresses the foreground-background noise in class activations by simultaneously maximizing intra-video and inter-video mutual information using a bottom-up attention mechanism. As a result, activations in the foreground regions are emphasized whereas those in the background regions are suppressed, thereby leading to more robust predictions. Comprehensive experiments are performed on multiple benchmarks, including THUMOS14 and ActivityNet1.2. Our D2-Net performs favorably in comparison to the existing methods on all datasets, achieving gains as high as 2.3% in terms of mAP at IoU=0.5 on THUMOS14. Source code is available at https://github.com/naraysa/D2-Net
翻译:这项工作提议了一个薄弱的监管时间行动本地化框架,称为D2-Net, 致力于利用视频级别监督将行动在时间上本地化。 我们的主要贡献是引入一个新的损失配方, 共同加强潜嵌入的可能性和产出时间级激活在地表- 地下噪音方面与监督薄弱造成的地表- 地下噪音的稳健性。 拟议的配方包括一个歧视性和分解损失的术语,用于加强时间行动本地化。 歧视性术语包含一个分类损失,并使用一个自上而下的关注机制,以加强地表层- 地表地下潜嵌入的潜在分离性。 淡化损失名词明确解决了课堂启动中的地表地噪音问题,同时利用一个自下而上的关注机制,最大限度地增加视频内部和图像间相互信息。 因此,强调地表区域的激活,而背景区域则受到压制,从而导致更可靠的预测。 在多个基准上,包括THUMOOS14和ADNet1.2。 我们的D2- Net在可获取的源代码方面,可与现有数据获取的2.3%的源代码上,在可比较现有数据获取的所有方法。