There exist challenging problems in 3D human pose estimation mission, such as poor performance caused by occlusion and self-occlusion. Recently, IMU-vision sensor fusion is regarded as valuable for solving these problems. However, previous researches on the fusion of IMU and vision data, which is heterogeneous, fail to adequately utilize either IMU raw data or reliable high-level vision features. To facilitate a more efficient sensor fusion, in this work we propose a framework called \emph{FusePose} under a parametric human kinematic model. Specifically, we aggregate different information of IMU or vision data and introduce three distinctive sensor fusion approaches: NaiveFuse, KineFuse and AdaDeepFuse. NaiveFuse servers as a basic approach that only fuses simplified IMU data and estimated 3D pose in euclidean space. While in kinematic space, KineFuse is able to integrate the calibrated and aligned IMU raw data with converted 3D pose parameters. AdaDeepFuse further develops this kinematical fusion process to an adaptive and end-to-end trainable manner. Comprehensive experiments with ablation studies demonstrate the rationality and superiority of the proposed framework. The performance of 3D human pose estimation is improved compared to the baseline result. On Total Capture dataset, KineFuse surpasses previous state-of-the-art which uses IMU only for testing by 8.6\%. AdaDeepFuse surpasses state-of-the-art which uses IMU for both training and testing by 8.5\%. Moreover, we validate the generalization capability of our framework through experiments on Human3.6M dataset.
翻译:3D 人造外观估计任务中存在挑战性问题,例如隐蔽和自我隐蔽导致的性能不佳。 最近,IMU 视觉传感器聚合被认为对解决这些问题很有价值。 然而,以前对IMU和视觉数据的融合研究,虽然是多种多样的,但未能充分利用IMU原始数据或可靠的高视界特征。为了促进更有效的感应聚合,我们在此工作中提议了一个框架,即人体运动模型下的校准和协调IMU原始数据。具体地说,我们汇总了IMU或视觉数据的不同信息,并采用了三种独特的感应聚合方法:NiveFuse、KineFuse和AdaDepFuse。NiveFuse服务器作为基本方法,仅将IMU数据简化和3D在奥氏空间构成。在运动空间中,KineFuse能够将校准和校正的IMU原始数据与转换的3D构成参数相结合。我们进一步开发了这种离心化数据融合过程,通过智能测试和智能模型测试,将我们的直径直径测试和直径测试框架的精确度测试结果与前的模型进行对比。