6D pose estimation is the task of predicting the translation and orientation of objects in a given input image, which is a crucial prerequisite for many robotics and augmented reality applications. Lately, the Transformer Network architecture, equipped with a multi-head self-attention mechanism, is emerging to achieve state-of-the-art results in many computer vision tasks. DETR, a Transformer-based model, formulated object detection as a set prediction problem and achieved impressive results without standard components like region of interest pooling, non-maximal suppression, and bounding box proposals. In this work, we propose T6D-Direct, a real-time single-stage direct method with a transformer-based architecture built on DETR to perform 6D multi-object pose direct estimation. We evaluate the performance of our method on the YCB-Video dataset. Our method achieves the fastest inference time, and the pose estimation accuracy is comparable to state-of-the-art methods.
翻译:6D 构成估计是预测特定输入图像中对象的翻译和方向的任务,这是许多机器人和增强现实应用的关键先决条件。 近来,配备多头自留机制的变换网络结构正在出现,以在许多计算机愿景任务中实现最新成果。 DETR是一个以变换器为基础的模型,将物体探测作为一组预测问题,并取得了令人印象深刻的成果,没有标准的组成部分,如利益集中区域、非最大抑制和捆绑框提案。 在这项工作中,我们提出了T6D-Direct,这是实时的单级直接方法,其基于变压器的架构建在DETR上,用于执行6D多球。我们评估我们在YCB-Video数据集上的方法的性能。我们的方法达到最快的推论时间,而其估计准确性与最新方法相当。