Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.
翻译:视频文本检索检索器(VTR)旨在搜索与某一句语义和反之亦然的语义最相关的视频。一般而言,这一检索任务由连续四个步骤组成:视频和文字特征代表提取、特征嵌入和匹配以及客观功能。最后,从数据集中提取的样本清单根据其与查询的相似性排列。近年来,深层次学习技术取得了显著和蓬勃的进展,但是,VTR仍是一项艰巨的任务,因为问题包括如何学习高效的空间时空视频特征以及如何缩小跨模式差距。在这次调查中,我们审查和总结了100多份与VTR有关的研究论文,展示了几个共同基准数据集的最新表现,并讨论了潜在的挑战和方向,希望为视频文本检索领域的研究人员提供一些见解。</s>