Transparent objects are widely used in industrial automation and daily life. However, robust visual recognition and perception of transparent objects have always been a major challenge. Currently, most commercial-grade depth cameras are still not good at sensing the surfaces of transparent objects due to the refraction and reflection of light. In this work, we present a transformer-based transparent object depth estimation approach from a single RGB-D input. We observe that the global characteristics of the transformer make it easier to extract contextual information to perform depth estimation of transparent areas. In addition, to better enhance the fine-grained features, a feature fusion module (FFM) is designed to assist coherent prediction. Our empirical evidence demonstrates that our model delivers significant improvements in recent popular datasets, e.g., 25% gain on RMSE and 21% gain on REL compared to previous state-of-the-art convolutional-based counterparts in ClearGrasp dataset. Extensive results show that our transformer-based model enables better aggregation of the object's RGB and inaccurate depth information to obtain a better depth representation. Our code and the pre-trained model will be available at https://github.com/yuchendoudou/TODE.
翻译:透明物体在工业自动化和日常生活中被广泛使用。然而,对透明物体的强烈视觉识别和感知始终是一项重大挑战。目前,由于光的折射和反射,大多数商业级深水照相机在感测透明物体表面方面仍然不善。在这项工作中,我们从一个单一的 RGB-D 输入中提出了一个基于变压器的透明物体深度估计方法。我们发现,变压器的全球特性使得更容易提取对透明区域进行深度估计的背景信息。此外,为了更好地增强精细的特性,还设计了一个特征集成模块(FFM)来帮助进行一致的预测。我们的经验证据表明,我们的模型在最新流行数据集中提供了显著改进,例如, RUSE 增加了25%, REL 增加了21%, 与ClearGrasp 数据集中以前最先进的革命性对应方相比, 我们的变压器模型使得对物体的 RGB 和不准确的深度信息进行更好的集成, 以便获得更好的深度表述。我们的代码和事先训练模型将在 https/doubchen.