Temporal sentence grounding (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed video.Most existing methods extract frame-grained features or object-grained features by 3D ConvNet or detection network under a conventional TSG framework, failing to capture the subtle differences between frames or to model the spatio-temporal behavior of core persons/objects. In this paper, we introduce a new perspective to address the TSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal behaviors. Specifically, we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal templates and search space, filtering objects and activities, and (B) a Temporal Sentence Tracker to track multi-modal targets for modeling the targets' behavior and to predict query-related segment. Extensive experiments and comparisons with state-of-the-arts are conducted on challenging benchmarks: Charades-STA and TACoS. And our TSTNet achieves the leading performance with a considerable real-time speed.
翻译:定时刑(TSG)旨在将时间段本地化,该时间段与未经剪裁的视频中的自然语言查询在音质上与自然语言查询一致。 现有方法中,3D ConvNet或探测网络在常规TSG框架内提取框架定位特征或对象定位特征,无法捕捉框架之间的细微差异或模拟核心人员/对象的时空行为。在本文件中,我们引入了一个新的视角,通过跟踪关键物体和活动,学习更精细的时空行为来应对TSG任务。具体地说,我们提议建立一个新型的TSTNet(TSTNet)系统,其中包括(A)一个跨模式目标生成模型和搜索空间、过滤物体和活动,以及(B)一个温度跟踪器,以跟踪模拟目标行为和预测与查询有关的部分的多模式目标。在具有挑战性的基准上进行了广泛的实验和比较:Charades-STA和TACS.S.TSNet,其中包含(A)一个跨模式的生成模型和搜索空间、过滤物体和活动,以及(B)一个温度追踪器跟踪器跟踪器,以跟踪目标行为和预测相关部分。 与国家艺术实验和比较。