Accurate 3D reconstruction of the hand and object shape from a hand-object image is important for understanding human-object interaction as well as human daily activities. Different from bare hand pose estimation, hand-object interaction poses a strong constraint on both the hand and its manipulated object, which suggests that hand configuration may be crucial contextual information for the object, and vice versa. However, current approaches address this task by training a two-branch network to reconstruct the hand and object separately with little communication between the two branches. In this work, we propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches. We extensively investigate cross-branch feature fusion architectures with MLP or LSTM units. Among the investigated architectures, a variant with LSTM units that enhances object feature with hand feature shows the best performance gain. Moreover, we employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map, which further improves the reconstruction accuracy. Experiments conducted on public datasets demonstrate that our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.
翻译:对手和物体形状进行精确的三维重建,对于了解人体物体相互作用以及人类日常活动十分重要。与光手构成估计不同,手和被操纵物体的相互作用对手和被操纵物体都构成很大的制约,这表明手配置可能是该物体的关键背景信息,反之亦然。然而,目前的方法通过培训一个双支网络来应对这项任务,用两个分支之间很少的通信来分别重建手和物体。在这项工作中,我们提议在特征空间中共同考虑手和物体,并探索两个分支的对等性。我们广泛调查与MLP或LSTM单元的跨部门特性融合结构。在所调查的建筑中,一个与LSTM单元的变式显示用手功能增强物体特征的物体特征的最佳性能收益。此外,我们使用一个辅助深度估计模块,用估计深度地图来增加输入的RGB图像,从而进一步提高重建的准确性。在公共数据集上进行的实验表明,我们的方法在重建物体的准确性方面大大超出现有方法。