This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets show that our proposed HLGT achieves a new state-of-the-art performance.
翻译:本文研究超时判决地面定位的多媒体问题(TSG), 目的是根据给定句问询, 精确地在未剪接的视频中精确地确定特定视频段, 传统 TSG 方法主要遵循自上而下或自下而上的框架, 而不是端对端。 它们严重依赖耗时的后处理来完善地面定位结果。 最近, 提议了一些基于变压器的方法, 以便高效和有效地模拟视频和查询之间的细微语义调整。 虽然这些方法在某种程度上取得了显著的性能, 但它们同样将视频和查询的文字作为变异器输入, 用于连接, 无法用不同的语义表达其不同的颗粒度。 为了解决这个问题, 我们提议一个新的高压本地- 全球变异性变异性变异性变异性变异性变异性数据( 跨级变异性变异性变异性) 和全球变异性变异性变异性变异性变异性变异性变异性数据 。