Although end-to-end (E2E) learning has led to promising performance on a variety of tasks, it is often impeded by hardware constraints (e.g., GPU memories) and is prone to overfitting. When it comes to video captioning, one of the most challenging benchmark tasks in computer vision and machine learning, those limitations of E2E learning are especially amplified by the fact that both the input videos and output captions are lengthy sequences. Indeed, state-of-the-art methods of video captioning process video frames by convolutional neural networks and generate captions by unrolling recurrent neural networks. If we connect them in an E2E manner, the resulting model is both memory-consuming and data-hungry, making it extremely hard to train. In this paper, we propose a multitask reinforcement learning approach to training an E2E video captioning model. The main idea is to mine and construct as many effective tasks (e.g., attributes, rewards, and the captions) as possible from the human captioned videos such that they can jointly regulate the search space of the E2E neural network, from which an E2E video captioning model can be found and generalized to the testing phase. To the best of our knowledge, this is the first video captioning model that is trained end-to-end from the raw video input to the caption output. Experimental results show that such a model outperforms existing ones to a large margin on two benchmark video captioning datasets.
翻译:虽然端到端学习(E2E)在各种任务上带来了有希望的业绩,但往往受到硬件限制(例如,GPU记忆)的阻碍,而且容易过度装配。当涉及到视频字幕(计算机视觉和机器学习中最具挑战性的基准任务之一)时,E2E学习的这些局限性由于输入视频和输出字幕都是漫长的顺序而特别放大。事实上,最先进的视频字幕描述流程视频框架的方法是通过连动性神经网络,通过解开经常性神经网络生成字幕。如果我们以E2E方式连接它们,由此产生的模型既是记忆消耗型,又是数据饥饿型,因此培训难度极大。在本文中,我们提出了一个多重任务强化学习方法,用于培训E2E视频字幕模型的模型和产出说明型。主要想法是,从人文字幕视频中尽可能多的有效任务(例如,属性、奖赏和字幕),从而可以共同调节E2EE神经神经网络的搜索空间,从而使得数据饥饿模式的搜索空间,使得它变得极其困难。从E2级的图像阶段到最高级的图像分析阶段,可以展示一个从E25级的模型,从我们现有的图像到最高级的版本。