Temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video. However, recent works find that existing methods suffer a severe temporal bias problem. These methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets. To this end, this paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy. Our framework introduces two auxiliary tasks, cross-modal matching and temporal order discrimination, to promote the grounding model training. The cross-modal matching task leverages the content consistency between shuffled and original videos to force the grounding model to mine visual contents to semantically match queries. The temporal order discrimination task leverages the difference in temporal order to strengthen the understanding of long-term temporal contexts. Extensive experiments on Charades-STA and ActivityNet Captions demonstrate the effectiveness of our method for mitigating the reliance on temporal biases and strengthening the model's generalization ability against the different temporal distributions. Code is available at https://github.com/haojc/ShufflingVideosForTSG.
翻译:本文提议了一个新颖的培训框架,用于定位模型,以便使用打乱的视频解决时间偏差问题,同时又不失去定位准确性。我们的框架引入了两种辅助任务,即交叉式匹配和时间顺序歧视,以促进基础模式培训。交叉式匹配任务利用了打乱和原始视频之间的内容一致性,迫使基底模式的图像内容与语义内容匹配查询。时间顺序歧视任务利用时间差异,以加强对长期时间背景的理解。关于Charades-STA和活动网络的大规模实验展示了我们减少对时间偏差的依赖和加强模型对不同时间分布的通用能力的方法的有效性。