Given an untrimmed video and natural language query, video sentence grounding aims to localize the target temporal moment in the video. Existing methods mainly tackle this task by matching and aligning semantics of the descriptive sentence and video segments on a single temporal resolution, while neglecting the temporal consistency of video content in different resolutions. In this work, we propose a novel multi-resolution temporal video sentence grounding network: MRTNet, which consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is an encoder-decoder network, and output features in the decoder part are in conjunction with Transformers to predict the final start and end timestamps. Particularly, our MRT module is hot-pluggable, which means it can be seamlessly incorporated into any anchor-free models. Besides, we utilize a hybrid loss to supervise cross-modal features in MRT module for more accurate grounding in three scales: frame-level, clip-level and sequence-level. Extensive experiments on three prevalent datasets have shown the effectiveness of MRTNet.
翻译:根据未剪接的视频和自然语言查询,视频句落地旨在将视频中的目标时间时刻定位到本地。现有方法主要通过将描述性句和视频段的语义与单个时间分辨率相匹配和对齐来完成这项任务,同时忽视不同分辨率视频内容的时间一致性。在这项工作中,我们提议建立一个新型的多分辨率时间视频句落地网络:MRTNet,由多式特征编码器、多分辨率时空模块和一个预测模块组成。MRT模块是一个编码器-decoder网络,而解码器部分的输出功能与变换器结合,以预测最终开始和结束时间戳。特别是,我们的MRT模块是热插的,这意味着它可以无缝地融入任何无锚模型。此外,我们利用混合损失来监督MRT模块的跨模式特征,以便在三个尺度上更精确地定位:框架级、剪接级和序列级。对三种流行数据集进行的广泛实验显示了MRTNet的有效性。