Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold and Lotus adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs remains underexplored. In this paper, we propose SDPose, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct COCO-OOD, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Extensive ablations highlight the importance of diffusion priors, RGB reconstruction, and multi-scale SD U-Net features for cross-domain generalization, and t-SNE analyses further explain SD's domain-invariant latent structure. We also show that SDPose serves as an effective zero-shot pose annotator for controllable image and video generation.


翻译:预训练的扩散模型提供了丰富的多尺度潜在特征,正逐渐成为强大的视觉骨干网络。尽管近期研究如Marigold和Lotus通过适配扩散先验实现了具有强跨域泛化能力的密集预测,但其在结构化输出方面的潜力仍未得到充分探索。本文提出SDPose——一个基于Stable Diffusion构建的微调框架,旨在充分利用预训练扩散先验进行人体姿态估计。首先,我们未修改交叉注意力模块或引入可学习嵌入,而是直接在SD U-Net的图像潜在空间中预测关键点热力图,以保持原始生成先验。其次,通过轻量级卷积姿态头将这些潜在特征映射为关键点热力图,避免破坏预训练骨干网络。最后,为防止过拟合并增强分布外鲁棒性,我们引入辅助的RGB重建分支以保留可跨域迁移的生成语义。为评估域偏移下的鲁棒性,我们进一步构建了COCO-OOD——一个保留标注信息的风格迁移版COCO数据集。仅使用Sapiens在COCO上五分之一训练时长的情况下,SDPose在COCO验证集上达到与Sapiens-1B/2B相当的性能,并在跨域基准HumanArt和COCO-OOD上创造了新的最优结果。大量消融实验凸显了扩散先验、RGB重建及多尺度SD U-Net特征对跨域泛化的重要性,t-SNE分析进一步解释了SD的域不变潜在结构。我们还证明SDPose可作为可控图像与视频生成的有效零样本姿态标注器。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员