We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.
翻译:我们展示了MAV3D(Make-A-Video3D),这是从文本描述中生成三维动态场景的一种方法。我们的方法是使用四维动态神经光谱场(NERF),通过查询文本到Video(T2V)的传播模型,对场景外观、密度和运动一致性进行了优化。从所提供的文本产生的动态视频输出可以从任何摄像头的位置和角度查看,并可以合成到任何三维环境中。MAV3D不需要任何 3D 或 4D 数据,T2V 模型只对文本图像配对和未贴标签的视频进行了培训。我们用全面的定量和定性实验展示了我们的方法的有效性,并展示了比以前确定的内部基线的改进。根据我们的最佳知识,我们的方法是首先产生三维动态场景并给出文本描述。