3D photography renders a static image into a video with appealing 3D visual effects. Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints, and finally use an inpainting model to fill those missing/occluded regions. The inpainting model plays a crucial role in rendering quality, but it is normally trained on out-of-domain data. To reduce the training and inference gap, we propose a novel self-supervised diffusion model as the inpainting module. Given a single input image, we automatically construct a training pair of the masked occluded image and the ground-truth image with random cycle-rendering. The constructed training samples are closely aligned to the testing instances, without the need of data annotation. To make full use of the masked images, we design a Masked Enhanced Block (MEB), which can be easily plugged into the UNet and enhance the semantic conditions. Towards real-world animation, we present a novel task: out-animation, which extends the space and time of input objects. Extensive experiments on real datasets show that our method achieves competitive results with existing SOTA methods.
翻译:3D 摄影将静态图像转换成具有3D视觉效果的视频。 现有方法通常首先进行单镜深度估计, 然后将输入框架以各种视角对准后续框架, 最后使用一个油漆模型来填充缺失/ 隐蔽区域。 油漆模型在质量制作方面发挥着关键作用, 但通常在外部数据方面受过培训。 为了减少培训和推断差距, 我们提议了一个新的自我监督的传播模型, 作为油漆模块。 使用一个单一输入图像, 我们自动建造了一组掩蔽的隐蔽图像和地面真相图像的培训配对, 并随机循环复制。 构建的培训样本与测试实例紧密匹配, 不需要数据说明。 为了充分利用遮盖图像, 我们设计了一个有遮蔽的强化区块( MEB), 它很容易被插入到UNet, 并强化语义化条件。 面向真实世界动画, 我们提出了一个新任务: 外层匹配, 扩展空间和时间输入的图像和地面图象图像, 与我们现有的数据实验方法具有竞争性。