全面审查视频到文本问题 (A Comprehensive Review of the Video-to-Text Problem)

Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions.

翻译：视觉和语言领域的研究包括寻求将视觉和文字信息联系起来的具有挑战性的专题。当视觉信息与视频相关时,这使我们进入视频-文字研究,其中包括视频问答、自然语言视频摘要、视频到文字和文本到视频转换等若干具有挑战性的任务。本文回顾了视频到文字的问题,目的是将输入视频与文字描述联系起来。这一联系主要可以通过从一个文稿中检索最相关的描述,或者制作新的背景视频。这两种方式代表计算机视觉和自然语言处理社区的基本任务,要求从视频任务和视频字幕/描述任务中检索文本。这两项任务比预测或从图像中检索一个单句子要复杂得多。视频中的微小信息介绍了视频内容和相关语言描述结构的多样性和复杂性。本审查可以主要通过从一个文稿中提取最相关的描述,或者制作一个新的背景视频到一个视频到文字问题。这两个方法代表了计算机的视觉到文字处理社区的基本任务,要求从视频到视频任务和视频字幕说明/描述/描述任务。这两个任务远比预测或从图像中检索一个词要复杂得多。视频到从一个图像转换一个词句。视频转换一个词的句,我们用来分析每个研究的每个领域,我们用来分析其进度,我们用来分析每个领域,我们用来分析其研究的进度,我们用来分析其进度,我们用来分析数据。