Recently, a few self-supervised representation learning (SSL) methods have outperformed the ImageNet classification pre-training for vision tasks such as object detection. However, its effects on 3D human body pose and shape estimation (3DHPSE) are open to question, whose target is fixed to a unique class, the human, and has an inherent task gap with SSL. We empirically study and analyze the effects of SSL and further compare it with other pre-training alternatives for 3DHPSE. The alternatives are 2D annotation-based pre-training and synthetic data pre-training, which share the motivation of SSL that aims to reduce the labeling cost. They have been widely utilized as a source of weak-supervision or fine-tuning, but have not been remarked as a pre-training source. SSL methods underperform the conventional ImageNet classification pre-training on multiple 3DHPSE benchmarks by 7.7% on average. In contrast, despite a much less amount of pre-training data, the 2D annotation-based pre-training improves accuracy on all benchmarks and shows faster convergence during fine-tuning. Our observations challenge the naive application of the current SSL pre-training to 3DHPSE and relight the value of other data types in the pre-training aspect.
翻译:最近,一些自我监督的代表学习(SSL)方法比目标探测等视觉任务图像网络分类前培训(SSL)要好,但是,对3D人体构成和形状估计(DHPSE)的影响是值得置疑的,其目标被定在一个独特的类别,即人,与SSL有着固有的任务差距。我们从经验上研究和分析SSL的影响,并进一步将其与3DHPSE的其他培训前选择方案进行比较。替代方法是2D基于说明的培训和合成数据预培训前培训,与SLS的动机相同,目的是降低标签成本。它们对3D人体构成和形状估计(3DHPSE)的影响被广泛用作薄弱的监视或微调的来源,但没有被称之为培训前的来源。SLSA的方法平均比3DPS的常规图像网络预培训前基准低7.7%。相比之下,基于2D的预培训前培训提高了所有基准的准确性,显示在微调整期间更快的趋同性。我们在SLSLA前的观察中挑战了SLA前其他几类数据在培训前的应用。</s>