We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human geometry reconstructions. We evaluate our methods on publicly available datasets and show improvements over prior art.
翻译:我们介绍Vid2Avatar, 这是一种从单向的光线视频中学习人类动因的方法。 重新构建从单向的光线视频中自然移动的人类是困难的。 解决这个问题需要将人类与任意背景准确地区分开来。 此外, 还需要从短视频序列中重建详细的 3D 表面, 使其更具挑战性。 尽管存在这些挑战, 我们的方法并不要求从衣着人类扫描的大型数据集中提取任何地面真实性监督或前科, 也不要求我们依赖任何外部分割模块。 相反, 它通过在现场同时模拟人和背景,通过两个不同的神经领域进行参数化。 具体地说, 我们定义了人类在罐体空间中的时间一致性的人类代表, 并对背景模型、 罐形和纹理, 以及每个框架的人类表面参数进行优化。 正在引入一个用于体积制作的粗到的取样战略, 新的目标是直接在3D区进行现场分解和地面重建, 通过在现场同时进行人类和地面背景的模拟, 通过两个不同的神经领域进行测量。 具体地显示我们现有的、 之前的地质测量方法。