视频自动编码器: 静态 3D 结构和运动的自监管解脱 (Video Autoencoder: self-supervised disentanglement of static 3D structure and motion)

A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the video autoencoder extracts a disentangled representation of the scene includ- ing: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera pose for each frame. These two representations will then be re-entangled for rendering the input video frames. This video autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large- scale natural video datasets, and show generalization results on out-of-domain images.

翻译：以自我监督的方式从视频中学习 3D 结构和摄像头的悬浮显示器和摄像头的显示器。根据视频的时间连续性,我们的工作假设附近视频框中的 3D 场景结构保持静态。如果视频框的顺序作为输入,视频自动编码器提取了场景分解的显示器:(一) 代表 3D 结构的具有时间一致性的深度 voxel 特征,和(二) 每个框的摄像头显示轨迹为 3D 。这两个显示器随后将重新粘合以提供输入视频框。这个视频自动编码器可以直接使用像素重建损失来训练,而没有地面真相 3D 或相机的标记。分解的表示器可以应用于一系列任务, 包括新颖的视图合成、相机显示的形状估计, 以及随后的视频生成。我们评估了我们关于多个大型自然视频数据集的方法, 并在外部图像上显示一般化结果。

相关内容

自编码器

关注 140

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

【经典书】线性代数，436页pdf

专知会员服务

77+阅读 · 2021年3月16日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日