Although end-to-end (E2E) learning has led to impressive progress on a variety of visual understanding tasks, it is often impeded by hardware constraints (e.g., GPU memory) and is prone to overfitting. When it comes to video captioning, one of the most challenging benchmark tasks in computer vision, those limitations of E2E learning are especially amplified by the fact that both the input videos and output captions are lengthy sequences. Indeed, state-of-the-art methods for video captioning process video frames by convolutional neural networks and generate captions by unrolling recurrent neural networks. If we connect them in an E2E manner, the resulting model is both memory-consuming and data-hungry, making it extremely hard to train. In this paper, we propose a multitask reinforcement learning approach to training an E2E video captioning model. The main idea is to mine and construct as many effective tasks (e.g., attributes, rewards, and the captions) as possible from the human captioned videos such that they can jointly regulate the search space of the E2E neural network, from which an E2E video captioning model can be found and generalized to the testing phase. To the best of our knowledge, this is the first video captioning model that is trained end-to-end from the raw video input to the caption output. Experimental results show that such a model outperforms existing ones to a large margin on two benchmark video captioning datasets.
翻译:虽然端到端学习(E2E)在各种视觉理解任务上取得了令人印象深刻的进展,但往往受到硬件限制(例如,GPU记忆)的阻碍,而且容易过度适应。当涉及到视频字幕(计算机视觉中最具挑战性的基准任务之一)时,E2E学习的这些局限性特别放大,因为输入视频和输出字幕都是漫长的序列。事实上,通过连动神经网络对视频字幕过程视频框架的最先进的描述方法,并通过解开经常性神经网络生成字幕。如果我们以E2E方式将它们连接起来,由此产生的模型既是记忆消耗型,又是数据饥饿型,使得培训极为困难。在本文中,我们提出一个多任务强化学习方法,以培训E2E视频字幕模型和产出字幕模式。主要想法是,从人文字幕视频中尽可能多的有效任务(例如,模型属性、奖赏和字幕),这样他们就可以共同调节E2EE神经神经网络的搜索空间,从模型和数据饥饿型网络的搜索空间搜索空间,从这个经过培训的GloiE图像阶段到演示的E视频数据,从这个Gloial Streal Streal Streal Streal Streal 将展示到我们现有的图像的第二阶段,可以测试的图像到这个Greal Streal Streal Streal Streal vial vial vial Stal Stal view。