In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5x inference acceleration over TVG using 3D visual features. Codes are available at Open.Intel.
翻译:本文研究时间视频定位(TVG)的问题,即在一个长的未修剪视频中预测由文本句子描述的时刻的始末时间点。借助细粒度的三维视觉特征,TVG技术近年来取得了显著进展。然而,由于三维卷积神经网络(CNNs)的高复杂性,提取密集的三维视觉特征往往耗时,需要大量的内存和计算资源。为了实现高效的TVG,我们提出了一种新颖的文本-视觉提示(TVP)框架,它将经过优化的扰动模式(我们称之为"提示")引入到TVG模型的视觉输入和文本特征中。与三维CNNs截然不同的是,我们显示TVP能够在二维TVG模型中有效地共同训练视觉编码器和语言编码器,并使用低复杂度的稀疏二维视觉特征提高跨模态特征融合的性能。此外,我们提出一种时间距离IoU(TDIoU)损失,用于有效地学习TVG。在两个基准数据集Charades-STA和ActivityNet Captions上的实验证明,所提出的TVP显著提高了2D TVG的性能(如Charades-STA上提高了9.79%,ActivityNet Captions上提高了30.77%),并且比使用3D视觉特征的TVG推理加速了5倍。代码可在Open.Intel上获取。