Yoga is a popular form of exercise worldwide due to its spiritual and physical health benefits, but incorrect postures can lead to injuries. Automated yoga pose classification has therefore gained importance to reduce reliance on expert practitioners. While human pose keypoint extraction models have shown high potential in action recognition, systematic benchmarking for yoga pose recognition remains limited, as prior works often focus solely on raw images or a single pose extraction model. In this study, we introduce a curated dataset, 'Yoga-16', which addresses limitations of existing datasets, and systematically evaluate three deep learning architectures (VGG16, ResNet50, and Xception), using three input modalities (direct images, MediaPipe Pose skeleton images, and YOLOv8 Pose skeleton images). Our experiments demonstrate that skeleton-based representations outperform raw image inputs, with the highest accuracy of 96.09% achieved by VGG16 with MediaPipe Pose skeleton input. Additionally, we provide interpretability analysis using Grad-CAM, offering insights into model decision-making for yoga pose classification with cross-validation analysis.
翻译:瑜伽因其精神与身体健康益处而成为全球流行的锻炼方式,但不正确的姿势可能导致损伤。因此,自动化瑜伽姿势分类对于减少对专业教练的依赖具有重要意义。尽管人体姿态关键点提取模型在动作识别中展现出巨大潜力,但针对瑜伽姿势识别的系统性基准测试仍较为有限,先前研究往往仅关注原始图像或单一姿态提取模型。本研究引入了一个精心构建的数据集'Yoga-16',以解决现有数据集的局限性,并系统评估了三种深度学习架构(VGG16、ResNet50和Xception),采用三种输入模态(直接图像、MediaPipe Pose骨架图像和YOLOv8 Pose骨架图像)。实验结果表明,基于骨架的表示方法优于原始图像输入,其中VGG16结合MediaPipe Pose骨架输入实现了最高准确率96.09%。此外,我们通过Grad-CAM提供可解释性分析,并结合交叉验证分析揭示了瑜伽姿势分类模型的决策机制。