To date, little attention has been given to multi-view 3D human mesh estimation, despite real-life applicability (e.g., motion capture, sport analysis) and robustness to single-view ambiguities. Existing solutions typically suffer from poor generalization performance to new settings, largely due to the limited diversity of image-mesh pairs in multi-view training data. To address this shortcoming, people have explored the use of synthetic images. But besides the usual impact of visual gap between rendered and target data, synthetic-data-driven multi-view estimators also suffer from overfitting to the camera viewpoint distribution sampled during training which usually differs from real-world distributions. Tackling both challenges, we propose a novel simulation-based training pipeline for multi-view human mesh recovery, which (a) relies on intermediate 2D representations which are more robust to synthetic-to-real domain gap; (b) leverages learnable calibration and triangulation to adapt to more diversified camera setups; and (c) progressively aggregates multi-view information in a canonical 3D space to remove ambiguities in 2D representations. Through extensive benchmarking, we demonstrate the superiority of the proposed solution especially for unseen in-the-wild scenarios.
翻译:迄今为止,人们很少注意多视图 3D 人类网格估计,尽管实际寿命适用性(如运动捕获、体育分析)和对单一观点模糊性强;现有解决方案通常因多视图培训数据中图像-成对对的多元性差而导致对新环境的概括性表现不佳;为解决这一缺陷,人们探索了合成图像的使用。但是,除了提供数据和目标数据之间视觉差距的通常影响外,合成数据驱动的多视图估计器在培训期间也因过度适应通常不同于现实世界分布的相机视图分布抽样而受到影响。我们提出了应对这两个挑战的新型模拟培训管道,用于多视图人类网格恢复,(a) 依赖对合成到现实域差距更为强大的中间2D表层显示;(b) 利用可学习的校准和三角定位来适应更多样化的相机设置;以及(c) 逐步汇总3D 3C空间的多视图信息,以消除2D图案的模糊性。我们通过广泛的基准,特别展示了2D模型中的拟议解决方案的优越性。