Advances in technology have led to the development of methods that can create desired visual multimedia. In particular, image generation using deep learning has been extensively studied across diverse fields. In comparison, video generation, especially on conditional inputs, remains a challenging and less explored area. To narrow this gap, we aim to train our model to produce a video corresponding to a given text description. We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video. In the first phase, we focus on creating a high-quality single video frame while learning the relationship between the text and an image. As the steps proceed, our model is trained gradually on more number of consecutive frames.This step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions. Qualitative and quantitative experimental results on various datasets demonstrate the effectiveness of the proposed method.
翻译:技术的进步导致开发了能够创造理想的视觉多媒体的方法,特别是广泛研究了不同领域的深层次学习的图像生成,相比之下,视频生成,特别是有条件投入的视频生成,仍是一个挑战性且探索较少的领域。为了缩小这一差距,我们的目标是培训我们的模型,以制作与给定文本描述相对应的视频。我们提议了一个创新的培训框架,即文字到图像到视频的基因反反向网络(TIVGAN),它逐条发展,并最终制作一个全长度的视频。在第一阶段,我们侧重于建立一个高质量的单一视频框架,同时学习文本与图像之间的关系。随着步骤的不断推进,我们的模型将逐步在更多的连续框架上接受培训。这一逐步学习过程有助于稳定培训,并使得能够根据有条件的文本描述创建高分辨率视频。在各种数据集上取得的定性和定量实验结果显示了拟议方法的有效性。