A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the video autoencoder extracts a disentangled representation of the scene includ- ing: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera pose for each frame. These two representations will then be re-entangled for rendering the input video frames. This video autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large- scale natural video datasets, and show generalization results on out-of-domain images.
翻译:以自我监督的方式从视频中学习 3D 结构和摄像头的悬浮显示器和摄像头的显示器。 根据视频的时间连续性,我们的工作假设附近视频框中的 3D 场景结构保持静态。如果视频框的顺序作为输入,视频自动编码器提取了场景分解的显示器:(一) 代表 3D 结构的具有时间一致性的深度 voxel 特征,和(二) 每个框的摄像头显示轨迹为 3D 。这两个显示器随后将重新粘合以提供输入视频框。这个视频自动编码器可以直接使用像素重建损失来训练,而没有地面真相 3D 或相机的标记。 分解的表示器可以应用于一系列任务, 包括新颖的视图合成、相机显示的形状估计, 以及随后的视频生成。 我们评估了我们关于多个大型自然视频数据集的方法, 并在外部图像上显示一般化结果 。