6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we employ a learnable orientation estimation module to predict the orientation from the keypoints. Along with a separate translation estimation module, our model is end-to-end differentiable. Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods.
翻译:6D 对象的估测是自动机器人操纵应用程序的关键前提。 最先进的估测模型是以进化神经网络( CNN) 为基础的。 最近, 最初建议自然语言处理的架构变换器正在许多计算机视觉任务中取得最先进的结果。 与多头自留机制相比, 变换器可以使简单的单阶段端对端结构用于学习对象探测, 6D 对象构成共同估测。 在这项工作中, 我们提议 YOLOPose ( 你只看一次Pose 估测的短期形式), 一种基于变换器的多球6D 表示基于关键点回归的估测方法。 与在图像中预测关键点的标准热测图相比, 我们直接反移关键点。 此外, 我们使用一个可学习的方向估测算模块从关键点预测方向。 与一个单独的翻译估测算模块一样, 我们的模型是端对端到端可变的。 我们的方法适合实时应用, 并实现与状态方法相似的结果 。