Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video's description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping. In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a two-stage training setting, we first initialise our architecture using pre-trained encoders and decoders -- then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNet-v2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.
翻译:在视频和语言等许多视觉识别应用程序中,建设不同模式的通信,例如视频和语言,最近变得至关重要。在机器翻译的启发下,最近的一些模型使用编码器解码器战略来完成这项任务。(视频)编码器传统上是一个 Convolual神经网络(CNN),而(语言生成)解码(语言生成)则使用经常性神经网络(RNNN)进行。但是,目前最先进的编码和解码器是分开的。CNN在传统对象和(或)动作识别任务上已经预先接受过培训,并用来编码视频层面的功能。随后,解码器将优化在这种静态功能上生成视频描述。这种脱节的设置可以说是用于输入(视频)到输出(描述)映射(解码器)的次最佳功能。我们提议在终端到终端的服务器上优化解码和解码器。我们首先使用预设的解码器解码器和解码器,然后,整个网络将经过培训,将最终显示我们的数据显示为升级的版本。