In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. The proposed prompts also compensate for the lack of spatiotemporal information in 2D CNNs for visual feature extraction. Further, we propose a TemporalDistance IoU (TDIoU) loss for efficient learning of TVG. Last but not least, extensive experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement in Charades-STA and 30.77% improvement in ActivityNet Captions) and achieves 5x inference acceleration over TVG of using 3D visual features. Code and model will be released.
翻译:在本文中,我们研究的是时间视频地面定位问题(TVG),其目的是预测在长长的未剪短的视频中文本句描述的时点的起始/结束时间点。TVG技术得益于精细的3D视觉特征,近年来取得了显著的进展。然而,3D革命神经网络(CNNs)的高度复杂性使得在2D电视G模型中提取稠密的3D视觉特征需要大量记忆和计算资源。为了高效的TVG,我们提议了一个新的文本-视频提示(TVP)框架,它将优化的推进模式模式(我们称之为“PROmpts”)纳入TVG模型的视觉投入和文本特征中。我们显示,TVP允许我们有效地在2D电视G模型中生成双电离层图像编码编码和语言编码,提高跨模式特性的性能(仅使用低兼容性稀释的3DVD) 3D视觉特征。我们提议的快速弥补了TVA2GOOVS-S-SODOD的缺乏显著的功能,而不能在2DG-DSLSVAL-SAL-SDSDSDSBSLA中展示中展示中展示中实现。</s>