RGB-based 3D hand pose estimation has been successful for decades thanks to large-scale databases and deep learning. However, the hand pose estimation network does not operate well for hand pose images whose characteristics are far different from the training data. This is caused by various factors such as illuminations, camera angles, diverse backgrounds in the input images, etc. Many existing methods tried to solve it by supplying additional large-scale unconstrained/target domain images to augment data space; however collecting such large-scale images takes a lot of labors. In this paper, we present a simple image-free domain generalization approach for the hand pose estimation framework that uses only source domain data. We try to manipulate the image features of the hand pose estimation network by adding the features from text descriptions using the CLIP (Contrastive Language-Image Pre-training) model. The manipulated image features are then exploited to train the hand pose estimation network via the contrastive learning framework. In experiments with STB and RHD datasets, our algorithm shows improved performance over the state-of-the-art domain generalization approaches.
翻译:由于大型数据库和深层学习,基于 RGB 的 3D 手表的估测几十年来一直很成功。 但是, 手表的估测网络对于与培训数据有极大差异的手表图像效果不佳。 造成这种情况的因素有多种, 例如光照、相机角度、输入图像的不同背景等等。 许多现有方法试图通过提供额外的大型无限制/目标域图像来解决这个问题, 以扩大数据空间; 然而, 收集这类大型图像需要大量劳动。 在本文中, 我们为手表的估测框架提出了一个简单的没有图像的通用域法, 只使用源域数据。 我们试图利用CLIP (CTRAT- Image Preview- traination) 模型的文字描述功能来操作手表的图像。 然后, 操纵的图像功能被用来通过对比学习框架来训练手表的估测网。 在与STB 和 RHD 数据集的实验中, 我们的算法显示, 在最先进的域概略化方法上, 。