The increasing availability of video recordings made by multiple cameras has offered new means for mitigating occlusion and depth ambiguities in pose and motion reconstruction methods. Yet, multi-view algorithms strongly depend on camera parameters; particularly, the relative transformations between the cameras. Such a dependency becomes a hurdle once shifting to dynamic capture in uncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), an end-to-end extrinsic parameter-free multi-view model. FLEX is extrinsic parameter-free (dubbed ep-free) in the sense that it does not require extrinsic camera parameters. Our key idea is that the 3D angles between skeletal parts, as well as bone lengths, are invariant to the camera position. Hence, learning 3D rotations and bone lengths rather than locations allows predicting common values for all camera views. Our network takes multiple video streams, learns fused deep features through a novel multi-view fusion layer, and reconstructs a single consistent skeleton with temporally coherent joint rotations. We demonstrate quantitative and qualitative results on three public datasets, and on synthetic multi-person video streams captured by dynamic cameras. We compare our model to state-of-the-art methods that are not ep-free and show that in the absence of camera parameters, we outperform them by a large margin while obtaining comparable results when camera parameters are available. Code, trained models, and other materials are available on our project page.
翻译:由多个相机制作的视频记录越来越容易获得,这为减轻表面和运动重建方法中的封闭性和深度模糊性提供了新的手段。然而,多视图算法在很大程度上依赖于相机参数,特别是相机之间的相对变化。这种依赖性一旦转移到动态捕捉,就会成为障碍。我们引入了FLEX(自由 muLti-view REconfruXion),一个端到端的外端无参数多视图模型。FLEX没有外端参数(无底盘),因为它不需要外部相机参数参数。我们的关键想法是,骨骼部分之间以及骨头长度之间的3D角度在不受控制的环境中变化不定。因此,我们学习了3D旋转和骨长度,而不是位置,可以预测所有相机视图的共同值。我们的网络使用多种视频流,通过新的多视图聚合层学习深度模型,重建一个与时间一致的联合旋转的骨架。我们的主要想法是,骨骼部分之间以及骨骼长度之间的3D角度角度角度角度角度角度角度角度角度角度的角,我们用3个可比较的模型展示了我们所拍摄到的多动的图像模型。