The increasing availability of video recordings made by multiple cameras has offered new means for mitigating occlusion and depth ambiguities in pose and motion reconstruction methods. Yet, multi-view algorithms strongly depend on camera parameters, in particular, the relative positions among the cameras. Such dependency becomes a hurdle once shifting to dynamic capture in uncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), an end-to-end parameter-free multi-view model. FLEX is parameter-free in the sense that it does not require any camera parameters, neither intrinsic nor extrinsic. Our key idea is that the 3D angles between skeletal parts, as well as bone lengths, are invariant to the camera position. Hence, learning 3D rotations and bone lengths rather than locations allows predicting common values for all camera views. Our network takes multiple video streams, learns fused deep features through a novel multi-view fusion layer, and reconstructs a single consistent skeleton with temporally coherent joint rotations. We demonstrate quantitative and qualitative results on the Human3.6M and KTH Multi-view Football II datasets. We compare our model to state-of-the-art methods that are not parameter-free and show that in the absence of camera parameters, we outperform them by a large margin while obtaining comparable results when camera parameters are available. Code, trained models, video demonstration, and additional materials will be available on our project page.
翻译:由多个相机制作的视频记录越来越容易获得,这为减轻表面和运动重建方法中的封闭性和深度模糊性提供了新的手段。然而,多视算法在很大程度上依赖于相机参数,特别是相机的相对位置。这种依赖性一旦转向动态捕获,就会成为障碍。我们引入了FLEX(Free muLti-view rEconfruXion),一个端到端无参数的多视图模型。FLEX是没有参数的,因为它不需要任何相机参数,无论是内在的还是外部的。我们的关键想法是,骨骼部分和骨骼长度之间的3D角度与相机位置是不一致的。因此,学习了3D旋转和骨头长度,而不是能够预测所有相机视图的共同值的位置。我们网络需要多个视频流,通过新的多视图聚合层学习深层特征,并且重建一个与时间一致的联合旋转的单一骨架。我们展示了人类3.6M和KTHY多视角多视角之间的3角度角度角度角度,以及骨长度与摄像师之间的距离是变化的变量。我们所了解的大型的模型,我们所了解的演示的模型的模型将显示的比值比重。我们所展示的模型将显示的大小的模型,而我们所展示的模型将显示的模型将显示的模型将显示的大小比重。