3D hand pose estimation has made significant progress in recent years. However, the improvement is highly dependent on the emergence of large-scale annotated datasets. To alleviate the label-hungry limitation, we propose a multi-view collaborative self-supervised learning framework, HaMuCo, that estimates hand pose only with pseudo labels for training. We use a two-stage strategy to tackle the noisy label challenge and the multi-view ``groupthink'' problem. In the first stage, we estimate the 3D hand poses for each view independently. In the second stage, we employ a cross-view interaction network to capture the cross-view correlated features and use multi-view consistency loss to achieve collaborative learning among views. To further enhance the collaboration between single-view and multi-view, we fuse the results of all views to supervise the single-view network. To summarize, we introduce collaborative learning in two folds, the cross-view level and the multi- to single-view level. Extensive experiments show that our method can achieve state-of-the-art performance on multi-view self-supervised hand pose estimation. Moreover, ablation studies verify the effectiveness of each component. Results on multiple datasets further demonstrate the generalization ability of our network.
翻译:3D 手势估计近年来取得了显著进展。 但是,改进高度取决于大规模附加说明数据集的出现。 为了缓解标签饥饿的限制,我们提议了一个多视角协作自监督学习框架HamuCo, 即只用假标签来估算手势在培训中使用假标签。我们用一个两阶段战略来应对吵闹标签的挑战和多视角“群体思考”的问题。在第一阶段,我们独立地估计3D手对每种观点的构成。在第二阶段,我们使用一个交叉视图互动网络来捕捉交叉视图相关特征,并利用多视角一致性损失来实现观点之间的协作学习。为了进一步加强单一视图和多视角之间的合作,我们整合所有观点的结果来监督单一视角网络。概括地说,我们引入了两端、交叉视图级别和多视角层面的协作学习。广泛的实验表明,我们的方法可以在多视角自我监督的手势下实现最新业绩。此外,对多种视角的网络能力研究还验证了我们每个数据组合的总体效果。