In this paper, we study the representation of the shape and pose of objects using their keypoints. Therefore, we propose an end-to-end method that simultaneously detects 2D keypoints from an image and lifts them to 3D. The proposed method learns both 2D detection and 3D lifting only from 2D keypoints annotations. In this regard, a novel method that explicitly disentangles the pose and 3D shape by means of augmentation-based cyclic self-supervision is proposed, for the first time. In addition of being end-to-end in image to 3D learning, our method also handles objects from multiple categories using a single neural network. We use a Transformer-based architecture to detect the keypoints, as well as to summarize the visual context of the image. This visual context information is then used while lifting the keypoints to 3D, so as to allow the context-based reasoning for better performance. While lifting, our method learns a small set of basis shapes and their sparse non-negative coefficients to represent the 3D shape in canonical frame. Our method can handle occlusions as well as wide variety of object classes. Our experiments on three benchmarks demonstrate that our method performs better than the state-of-the-art. Our source code will be made publicly available.
翻译:在本文中, 我们用关键点来研究对象的形状和形状。 因此, 我们提出一个端到端方法, 既从图像中检测 2D 关键点, 并将它们提升到 3D 。 提议的方法只从 2D 关键点注释中学习 2D 检测和 3D 。 在这方面, 首次提出了一种新颖方法, 以基于增强的以自行车为主的自我监督视野来明确分解形状和 3D 形状。 除了在图像中端到端到3D 学习之外, 我们的方法还同时用单一神经网络处理多个类别的物体。 我们使用一个基于变换器的架构来检测关键点, 并总结图像的视觉背景。 这个视觉背景信息随后在将关键点提升到 3D 时被使用, 以使基于背景的推理能够实现更好的性能。 我们的方法在3D 边框中学习了一套小的基础形状及其稀疏的非内系数来代表3D 形状 。 我们的方法可以将我们现有的三层的实验方法 演示我们三种不同的实验方法, 。 我们的模型将用来将我们现有的标准作为公共的源 。