We propose to perform self-supervised disentanglement of depth and camera pose from large-scale videos. We introduce an Autoencoder-based method to reconstruct the input video frames for training, without using any ground-truth annotations of depth and camera. The model encoders estimate the monocular depth and the camera pose. The decoder then constructs a Multiplane NeRF representation based on the depth encoder feature, and renders the input frames with the estimated camera. The learning is supervised by the reconstruction error, based on the assumption that the scene structure does not change in short periods of time in videos. Once the model is learned, it can be applied to multiple applications including depth estimation, camera pose estimation, and single image novel view synthesis. We show substantial improvements over previous self-supervised approaches on all tasks and even better results than counterparts trained with camera ground-truths in some applications. Our code will be made publicly available. Our project page is: https://oasisyang.github.io/self-mpinerf .
翻译:我们建议进行自我监督的深度分解和大型视频的摄像布局。 我们采用基于自动编码器的方法来重建输入的视频框架用于培训, 而不使用任何地面深度和摄像头的真相说明。 模型编码器估计单眼深度和相机布局。 然后, 解码器根据深度编码器特性构造一个多平板 NERF 代表器, 并用估计的相机提供输入框架。 学习由重建错误来监督, 假设现场结构不会在短期内在视频中发生变化。 一旦模型被学习, 就可以应用于多个应用程序, 包括深度估计、 相机显示和单一图像新视图合成。 我们展示了以前对所有任务采取的自我监督方法的重大改进, 甚至比在某些应用程序中接受过地面摄像仪训练的对应方要好。 我们的代码将被公开。 我们的项目网页是 https://oasisyang.github.io/self- mpinerf 。