We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.
翻译:我们展示了Phenaki, 这个模型能够现实的视频合成, 具有一系列文字提示。 从文本中生成视频特别具有挑战性, 因为计算成本、 数量有限的高质量文本视频数据和视频的长度不同。 为了解决这些问题, 我们引入了一种新的学习视频演示模式, 将视频压缩成一个小的离散符号代表。 这个代谢器在时间上使用因果关注, 允许它与变长视频合作。 为了从文本中生成视频符号, 我们使用的是双向遮盖变压器, 以预置文本符号为条件。 生成的视频符号随后被降级以创建实际视频。 为了解决数据问题, 我们展示了在大量图像配对以及较少的视频文本示例上进行联合培训的方式, 使得在视频数据集的现有内容之外进行概括化。 与先前的视频生成方法相比, Phenaki可以生成任意的长视频符号, 以提示序列为条件( 即时间变换文本或故事) 。 在开放域中, 生成一个比我们的最佳图像基准, 与每部的每部平时序生成一个最易变的图像研究 。