Movement and pose assessment of newborns lets experienced pediatricians predict neurodevelopmental disorders, allowing early intervention for related diseases. However, most of the newest AI approaches for human pose estimation methods focus on adults, lacking publicly benchmark for infant pose estimation. In this paper, we fill this gap by proposing infant pose dataset and Deep Aggregation Vision Transformer for human pose estimation, which introduces a fast trained full transformer framework without using convolution operations to extract features in the early stages. It generalizes Transformer + MLP to high-resolution deep layer aggregation within feature maps, thus enabling information fusion between different vision levels. We pre-train AggPose on COCO pose dataset and apply it on our newly released large-scale infant pose estimation dataset. The results show that AggPose could effectively learn the multi-scale features among different resolutions and significantly improve the performance of infant pose estimation. We show that AggPose outperforms hybrid model HRFormer and TokenPose in the infant pose estimation dataset. Moreover, our AggPose outperforms HRFormer by 0.8 AP on COCO val pose estimation on average. Our code is available at github.com/SZAR-LAB/AggPose.
翻译:在本文件中,我们通过提出婴儿构成的数据集和深度聚合愿景变异器来填补这一空白,为人体构成的估计数。我们提出婴儿构成的数据集和深度聚合愿景变异器用于人类构成的估计数,引入了一个经过快速训练的全变压器框架,而没有利用演动操作来提取早期的特征。我们显示,AggPose超越了混合模型 HRFormer和婴儿中的TokenPose构成的估计数。此外,我们关于COCOCO的AggPose 预示式AggPose 显示数据集,并将其应用到我们新发行的大规模婴儿构成的估计数。结果显示,AggPose可以有效地学习不同分辨率的多尺度特征,并大大改进婴儿构成的性能。我们显示,AggPose 超越了混合模型 HRFormer 和 TokenPose 显示的是地谱图中高分辨率的深层层层集成数据集。此外,我们关于COCOPose的Apperformer ats at AP-AUBZ在0.8 AP-AB 上的平均代码是可用的。