Accurate 6D object pose estimation is an important task for a variety of robotic applications such as grasping or localization. It is a challenging task due to object symmetries, clutter and occlusion, but it becomes more challenging when additional information, such as depth and 3D models, is not provided. We present a transformer-based approach that takes an RGB image as input and predicts a 6D pose for each object in the image. Besides the image, our network does not require any additional information such as depth maps or 3D object models. First, the image is passed through an object detector to generate feature maps and to detect objects. Then, the feature maps are fed into a transformer with the detected bounding boxes as additional information. Afterwards, the output object queries are processed by a separate translation and rotation head. We achieve state-of-the-art results for RGB-only approaches on the challenging YCB-V dataset. We illustrate the suitability of the resulting model as pose sensor for a 6-DoF state estimation task. Code is available at https://github.com/aau-cns/poet.
翻译:精确的 6D 对象显示估计是各种机器人应用的重要任务, 如抓取或定位等 。 由于对象对称、 杂乱和隔离, 这是一项艰巨的任务, 但当其他信息( 如深度和 3D 模型) 不提供时, 则更具挑战性。 我们提出了一个基于变压器的方法, 将 RGB 图像作为输入, 并预测图像中每个对象的6D 构成 。 除了图像外, 我们的网络不需要任何额外信息, 如深度地图或 3D 对象模型 。 首先, 图像通过物体探测器传递给一个对象探测器, 以生成地貌地图和探测对象 。 然后, 将地貌地图输入到一个变压器中, 带有检测到的捆绑盒作为补充信息 。 之后, 输出对象查询由单独的翻译和旋转头处理 。 我们在具有挑战的 YCB- V 数据集上实现只使用 RGB 的状态结果。 我们演示由此产生的模型是否适合作为 6- DoF 状态估计任务 。 代码可在 https://github. com/ a- au- cons/ opetetetet.