Deep Convolutional Neural Networks (CNNs) are powerful models that have achieved excellent performance on difficult computer vision tasks. Although CNNs perform well whenever large labeled training samples are available, they work badly on video frame synthesis due to objects deforming and moving, scene lighting changes, and cameras moving in video sequence. In this paper, we present a novel and general end-to-end architecture, called convolutional Transformer or ConvTransformer, for video frame sequence learning and video frame synthesis. The core ingredient of ConvTransformer is the proposed attention layer, i.e., multi-head convolutional self-attention layer, that learns the sequential dependence of video sequence. ConvTransformer uses an encoder, built upon multi-head convolutional self-attention layer, to encode the sequential dependence between the input frames, and then a decoder decodes the long-term dependence between the target synthesized frames and the input frames. Experiments on video future frame extrapolation task show ConvTransformer to be superior in quality while being more parallelizable to recent approaches built upon convolutional LSTM (ConvLSTM). To the best of our knowledge, this is the first time that ConvTransformer architecture is proposed and applied to video frame synthesis.
翻译:深相神经网络(CNNs)是强大的模型,在难以完成的计算机视觉任务中取得了卓越的成绩。尽管CNN在有标签的大型培训样本时表现良好,但是在视频框架合成方面却表现不佳,因为物体变形和移动、场光变化和摄像机在视频序列中移动。在本文中,我们展示了一个新型和一般端对端结构,称为变形变异器或Convtransferent,用于视频框架序列学习和视频框架合成。Convtrafrench的核心成分是拟议的关注层,即多头的中央自留层,可以学习视频序列的相继依赖性。Convtrafrench使用一个编码器,该编码器建在多头革命自留图层上,用以编码输入框架之间的相继依赖性,然后用一个解码解码器来描述目标合成框架和输入框架之间的长期依赖性。对视频未来框架外推任务进行实验后显示,Contrafredrend在质量上更优,同时学习最近构建的图像序列结构,这是我们拟议的变革合成框架。TRATM(TLVTM)的拟议是用于。