Recently, data-driven single-view reconstruction methods have shown great progress in modeling 3D dressed humans. However, such methods suffer heavily from depth ambiguities and occlusions inherent to single view inputs. In this paper, we address such issues by lifting the single-view input with additional views and investigate the best strategy to suitably exploit information from multiple views. We propose an end-to-end approach that learns an implicit 3D representation of dressed humans from sparse camera views. Specifically, we introduce two key components: first an attention-based fusion layer that learns to aggregate visual information from several viewpoints; second a mechanism that encodes local 3D patterns under the multi-view context. In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively. Additionally, we apply our method on real data acquired with a multi-camera platform and demonstrate our approach can obtain results comparable to multi-view stereo with dramatically less views.
翻译:最近,数据驱动的单一视图重建方法在3D装饰人的模型制作方面取得了巨大进展,然而,这种方法因深度模糊和单个视图投入所固有的封闭性而深受其害。在本文件中,我们通过取消单视图投入并增加更多观点来解决这些问题,并调查从多种观点中适当利用信息的最佳战略。我们建议了端对端方法,从稀疏的相机观点中学习穿饰人隐含的3D代表。具体地说,我们引入了两个关键组成部分:第一,基于关注的聚合层,从几个角度学习汇总视觉信息;第二,在多视图背景下编码本地3D模式的机制。在实验中,我们展示了拟议方法在定量和定性标准数据上优于艺术状态。此外,我们运用了我们的方法,从多镜头平台上获取的真实数据,并展示了我们的方法可以取得与多视角立体相近的结果。