We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.
翻译:我们提出了MonoSE(3)-Diffusion,一种单目SE(3)扩散框架,将无标记、基于图像的机器人位姿估计表述为一个条件去噪扩散过程。该框架包含两个过程:一个用于多样化位姿增强的可见性约束扩散过程,以及一个用于渐进式位姿细化的时间步感知反向过程。扩散过程逐步扰动真实位姿,生成带噪声的变换用于训练位姿去噪网络。重要的是,我们将可见性约束整合到该过程中,确保变换始终保持在相机视野内。与现有方法中使用的固定尺度扰动相比,该扩散过程生成了视野内且多样化的训练位姿,从而提升了网络的泛化能力。此外,反向过程通过去噪网络迭代预测位姿,并遵循预定的由粗到细流程,从当前时间步的扩散后验分布中采样以细化位姿估计。同时,时间步指示了变换尺度,引导去噪网络实现更精确的位姿预测。得益于其时间步感知的细化方案,反向过程展现出比直接预测更高的鲁棒性。我们的方法在两个基准数据集(DREAM和RoboKeyGen)上均取得了性能提升,在最具挑战性的数据集上达到了66.75的显著AUC值,相比现有最优方法提升了32.3%。