Temporal language grounding (TLG) is a fundamental and challenging problem for vision and language understanding. Existing methods mainly focus on fully supervised setting with temporal boundary labels for training, which, however, suffers expensive cost of annotation. In this work, we are dedicated to weakly supervised TLG, where multiple description sentences are given to an untrimmed video without temporal boundary labels. In this task, it is critical to learn a strong cross-modal semantic alignment between sentence semantics and visual content. To this end, we introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding. Specifically, WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm, with a whole description paragraph as input. Moreover, we integrate a complementary branch into the framework, which explicitly refines the predictions with pseudo supervision from the MIL stage. An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising. Extensive experiments are conducted on three widely used benchmark datasets, \emph{i.e.}, ActivityNet-Captions, Charades-STA, and DiDeMo, and the results demonstrate the effectiveness of our approach.
翻译:在这项工作中,我们致力于监督薄弱的TLG, 将多个描述性句子加到一个没有时间边界标签的未剪辑的视频中。在这项工作中,我们把多个描述性句子加到一个没有时间边界标签的未剪辑的视频中。在这项任务中,关键是要学习在判决语义和视觉内容之间强有力的跨模式语义调整。为此,我们引入了一个新颖的、监管不力的临近时间时间网络(WSTAN),用于时间语言定位。具体地说,WSTAN通过在多个实例学习(MIL)范式中利用与时间相邻的网络学习跨模式语义调整,作为投入。此外,我们把一个互补分支纳入到这个框架中,明确用来自MIL阶段的虚假监管来完善预测。在MIL分支和互补分支上还设计了额外的自我歧视损失,目的是通过自我监控强化语义歧视。在三种实例中,进行了广泛的实验。