Visual grounding is a long-lasting problem in vision-language understanding due to its diversity and complexity. Current practices concentrate mostly on performing visual grounding in still images or well-trimmed video clips. This work, on the other hand, investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression, which is more challenging yet practical in real-world scenarios. Importantly, grounding results are expected to accurately localize targets in both space and time. Whereas, it is tricky to make trade-offs between the appearance and motion features. In real scenarios, model tends to fail in distinguishing distractors with similar attributes. Motivated by these considerations, we propose a simple yet effective approach, named DSTG, which commits to 1) decomposing the spatial and temporal representations to collect all-sided cues for precise grounding; 2) enhancing the discriminativeness from distractors and the temporal consistency with a contrastive learning routing strategy. We further elaborate a new video dataset, GVG, that consists of challenging referring cases with far-ranging videos. Empirical experiments well demonstrate the superiority of DSTG over state-of-the-art on Charades-STA, ActivityNet-Caption and GVG datasets. Code and dataset will be made available.
翻译:视觉地基是视觉语言理解的一个长期问题,因为其多样性和复杂性。目前的做法主要集中于在静止图像或剪剪精的视频片段中进行视觉地基。另一方面,这项工作主要集中于在静止图像或剪短的视频片段中进行视觉地基工作。另一方面,我们提出一个简单而有效的方法,名为DSTG, 承诺(1) 将空间和时间的表达方式分解,以收集精确地地基的全方位指示;(2) 增强分散器的偏向性,以及时间与对比性学习路线战略之间的时间一致性。我们进一步拟订一个新的视频数据集,即GVG, 其中包括以范围很广的视频对案件进行评分的挑战。根据这些考虑,我们提出一个简单而有效的方法,即DSTG, 名为DSTG, 承诺1) 将空间和时间的表达方式分解,以收集精确地基底线标;(2) 增强分散器的偏向性,以及时间与对比性学习路线战略的一致性。我们进一步拟订一个新的视频数据集,即GVG,用远处的录像对案件进行评断。