We introduce a novel approach that takes a single semantic mask as input to synthesize multi-view consistent color images of natural scenes, trained with a collection of single images from the Internet. Prior works on 3D-aware image synthesis either require multi-view supervision or learning category-level prior for specific classes of objects, which can hardly work for natural scenes. Our key idea to solve this challenging problem is to use a semantic field as the intermediate representation, which is easier to reconstruct from an input semantic mask and then translate to a radiance field with the assistance of off-the-shelf semantic image synthesis models. Experiments show that our method outperforms baseline methods and produces photorealistic, multi-view consistent videos of a variety of natural scenes.
翻译:我们引入了一种新颖的方法,将单一语义面罩作为合成多视角自然景象一致的彩色图像的投入,并经过互联网收集单一图像的培训。 3D维图像合成的先前工作要么需要多视角监督,要么在特定对象类别之前先学习分类,这些类别对自然景象几乎行不通。 我们解决这一具有挑战性问题的关键想法是使用语义场作为中间代表,这更容易从输入语义面罩中重建,然后在现成的语义图像合成模型的帮助下转化成光亮场。 实验表明,我们的方法超越了基线方法,产生了各种自然景象的摄影现实、多视角一致的视频。