Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years. In this paper, we convert traditional video captioning task into a new paradigm, \ie, Open-book Video Captioning, which generates natural language under the prompts of video-content-relevant sentences, not limited to the video itself. To address the open-book video captioning problem, we propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively, and a copy-mechanism generator is introduced to extract expressions from multi-retrieved sentences dynamically. The two modules can be trained end-to-end or separately, which is flexible and extensible. Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video. Extensive experiments on several benchmark datasets show that our proposed approach surpasses the state-of-the-art performance, indicating the effectiveness and promising of the proposed paradigm in the task of video captioning.
翻译:由于短视频的迅速出现以及内容理解和创建的需要,视频字幕任务近年来受到越来越多的关注。在本文中,我们将传统视频字幕任务转换成一个新的范例,即:\ie,Open-book Vide Vicap Caption,在视频内容相关句子的提示下,不仅在视频本身之外,产生自然语言。为了解决公开书籍视频字幕问题,我们提议建立一个新型的Retreve-Copy-Generate网络,在网络中,可插入视频到文本检索器,以便有效地从培训教材中提取提示,并引入一个复制机械生成器,以动态地从多检索的句子提取表达方式。两个模块可以经过培训的终端到终端或分开,可以灵活和可扩展。我们的框架将传统的检索方法与正统的编码解密方法相协调,不仅可以借鉴检索的句子中的多种表达方式,还可以产生自然和准确的视频内容。在几个基准数据集上进行广泛的实验,显示我们所提议的任务模式的进度将超过任务方向的状态。