Animal pose estimation is a fundamental task in computer vision, with growing importance in ecological monitoring, behavioral analysis, and intelligent livestock management. Compared to human pose estimation, animal pose estimation is more challenging due to high interspecies morphological diversity, complex body structures, and limited annotated data. In this work, we introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation. Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models. To enhance semantic guidance during keypoint generation, we leverage large language models (LLMs) to extract both global anatomical priors and local keypoint-wise semantics based on species-specific prompts. These textual priors are encoded and fused with image features via cross-attention modules to provide biologically meaningful constraints throughout the denoising process. Additionally, a diffusion-based keypoint decoder is designed to progressively refine pose predictions, improving robustness to occlusion and annotation sparsity. Extensive experiments on public animal pose datasets demonstrate the effectiveness and generalization capability of our method, especially under challenging scenarios with diverse species, cluttered backgrounds, and incomplete keypoints.
翻译:动物姿态估计是计算机视觉中的一项基础任务,在生态监测、行为分析和智能畜牧管理中日益重要。与人体姿态估计相比,动物姿态估计因物种间形态多样性高、身体结构复杂以及标注数据有限而更具挑战性。本文提出DiffPose-Animal,一种基于扩散模型的自顶向下动物姿态估计新框架。不同于传统的热图回归方法,DiffPose-Animal将姿态估计重新定义为扩散模型生成框架下的去噪过程。为增强关键点生成过程中的语义引导,我们利用大语言模型(LLMs)基于物种特定提示提取全局解剖学先验和局部关键点语义。这些文本先验通过交叉注意力模块编码并与图像特征融合,从而在整个去噪过程中提供具有生物学意义的约束。此外,设计了一种基于扩散的关键点解码器,以逐步优化姿态预测,提升对遮挡和标注稀疏情况的鲁棒性。在公开动物姿态数据集上的大量实验证明了本方法的有效性和泛化能力,尤其在涉及多样物种、杂乱背景和不完整关键点的挑战性场景中。