全面审查视频到文本问题 (A Comprehensive Review of the Video-to-Text Problem)

Research in the area of Vision and Language encompasses challenging topics that seek to connect visual and textual information. The video-to-text problem is one of these topics, in which the goal is to connect an input video with its textual description. This connection can be mainly made by retrieving the most significant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze how the most reported benchmark datasets have been created, showing their drawbacks and strengths for the problem requirements. We also show the impressive progress that researchers have made on each dataset, and we analyze why, despite this progress, the video-to-text conversion is still unsolved. State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions. We cover several significant challenges in the field and discuss future research directions.

翻译：视觉和语言领域的研究包括寻求将视觉和文字信息连接起来的具有挑战性的专题。视频到文字问题就是其中的一个,目的是将输入视频与文字描述联系起来。这种联系主要可以通过从文稿中检索最重要的描述,或制作新的背景视频进行。这两种方式代表计算机视觉和自然语言处理社区的基本任务,即从视频任务和视频字幕/描述任务中检索文本。这两个任务远比从图像中预测或重读一个句子要复杂得多。视频中的片段信息带来了视觉内容和相关语言描述结构的多样性和复杂性。本审查对视频到文字问题的最新技术进行分类和描述。它涵盖了主要的视频到文字方法以及评估其业绩的方法。我们分析了报告最多的基准数据集是如何形成的,显示了它们仍然在问题要求方面的缺陷和长处。我们还展示了研究人员在每一数据设置中仍然取得的令人印象深刻的进展。我们分析的是,尽管取得了这一进展,但是,我们还是从一个未来的图像转换过程,我们还是从一个动态的实地,我们还是从一个令人印象深刻的实地,我们分析了为什么要从一个巨大的研究到一个图像转换。