Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency. In this paper, we present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT), to alleviate these issues. Specially, we introduce a novel multi-modal template as the global objective to address this task, which explicitly constricts the grounding region and associates the predictions among all video frames. Moreover, to generate the above template under sufficient video-textual perception, an encoder-decoder architecture is proposed for effective global context modeling. Thanks to these critical designs, STCAT enjoys more consistent cross-modal feature alignment and tube prediction without reliance on any pre-trained object detectors. Extensive experiments show that our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks (VidSTG and HC-STVG), illustrating the superiority of the proposed framework to better understanding the association between vision and natural language. Code is publicly available at \url{https://github.com/jy0205/STCAT}.
翻译:Spatio-Teptio-Temporal视频定位(STVG)侧重于重新获取由自由形式文本表达的某个特定对象的时空管管线(STVG) 。 现有方法主要将这一复杂任务视为平行的基底问题,因此存在两种不一致的缺陷: 特征校准不一致和预测不一致。 在本文中,我们提出了一个端到端的一阶段框架,称为 Spatio-Tempio-Tempority-Award Constanticity-Awarder (STCAT),以缓解这些问题。 特别是,我们引入了一个新型的多模式模板,作为完成这项任务的全球目标,明确限制地面区域,并将所有视频框架的预测联系起来。 此外,为了在足够的视频- Text- 认知下生成上述模板,为有效的全球背景建构。 由于这些关键设计,STCAT享有更加一致的跨模式调调和管子预测,而无需依赖任何事先培训的天体探测器。 广泛的实验显示,我们的方法优于先前的州/GV-ST-TV-S级标准之间, 以及两个清晰的自然理解空间-SG-ral-ral-r-r-r-r-r-r-r-r-r-r-r-s-r-r-com-r-r-r-r-r-r-r-r-r-r-sal-ralbalbal-s-s-s-lorislorislorismlorisal-combalbal