Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.
翻译:以视频(TSGV)、 a.k.a.、自然语言视频本地化(NLVL)或视频瞬间检索(VMR)作为时间句的基础,目的是从一个没有剪辑的视频中检索一个语义上与语言查询相对应的时间时刻。连接计算机视觉和自然语言,TSGV吸引了两族研究人员的极大关注。这次调查试图提供TSGV的基本概念和当前研究状况以及未来研究方向的概要。作为背景,我们提出了一个TSGV中功能组成部分的共同结构,其教义风格是:从原始视频和语言查询中提取特征,回答目标时刻的预测。然后我们审查多式理解和互动技术,这是TSGV在两种模式之间有效一致的关键焦点。我们构建了TSGV技术的分类,并按其长处和短处在不同类别制定方法。最后,我们与当前的TSGV研究讨论各种问题,并分享我们对有希望的研究方向的见解。