This paper proposes a novel model for video generation and especially makes the attempt to deal with the problem of video generation from text descriptions, i.e., synthesizing realistic videos conditioned on given texts. Existing video generation methods cannot be easily adapted to handle this task well, due to the frame discontinuity issue and their text-free generation schemes. To address these problems, we propose a recurrent deconvolutional generative adversarial network (RD-GAN), which includes a recurrent deconvolutional network (RDN) as the generator and a 3D convolutional neural network (3D-CNN) as the discriminator. The RDN is a deconvolutional version of conventional recurrent neural network, which can well model the long-range temporal dependency of generated video frames and make good use of conditional information. The proposed model can be jointly trained by pushing the RDN to generate realistic videos so that the 3D-CNN cannot distinguish them from real ones. We apply the proposed RD-GAN to a series of tasks including conventional video generation, conditional video generation, video prediction and video classification, and demonstrate its effectiveness by achieving well performance.
翻译:本文提出了一个新的视频生成模式,特别是试图从文本描述中处理视频生成问题,即综合以特定文本为条件的现实视频。现有的视频生成方法由于框架不连续问题及其无文本生成计划,无法轻易调整以很好地处理这一任务。为了解决这些问题,我们提议了一个经常性的分层遗传对抗网络(RD-GAN),其中包括一个经常性的分层网络(RDN),作为生成者,以及一个3D进化神经网络(3D-CNN)作为歧视者。RDN是一个传统的常规经常性神经网络的分层版本,它可以很好地模拟生成的视频框架的长期时间依赖性,并很好地利用有条件的信息。拟议的模型可以通过推动RDN生成现实的视频来共同培训,从而使3D-CNN无法将其与真实的视频区分。我们将拟议的RD-GAN应用于一系列任务,包括常规视频生成、有条件的视频生成、视频预测和视频分类,并通过很好地实现其效果来展示其有效性。