Reconstructing a 3D object from a 2D image is a well-researched vision problem, with many kinds of deep learning techniques having been tried. Most commonly, 3D convolutional approaches are used, though previous work has shown state-of-the-art methods using 2D convolutions that are also significantly more efficient to train. With the recent rise of transformers for vision tasks, often outperforming convolutional methods, along with some earlier attempts to use transformers for 3D object reconstruction, we set out to use visual transformers in place of convolutions in existing efficient, high-performing techniques for 3D object reconstruction in order to achieve superior results on the task. Using a transformer-based encoder and decoder to predict 3D structure from 2D images, we achieve accuracy similar or superior to the baseline approach. This study serves as evidence for the potential of visual transformers in the task of 3D object reconstruction.
翻译:从 2D 图像重建 3D 对象是一个研究周全的视觉问题, 已经尝试了许多深层次的学习技巧。 最常见的是, 3D 进化方法, 尽管先前的工作已经展示了使用 2D 进化方法的最先进方法, 而这些方法在培训上也非常有效 。 随着最近变压器用于视觉任务, 往往优于进化方法, 以及早先试图使用变压器进行 3D 对象重建的一些尝试, 我们开始使用视觉变压器, 取代现有高效、 高性能的3D 对象重建技术, 以便取得更优越的成果 。 使用基于变压器的编码器和解码器从 2D 图像中预测 3D 结构, 我们的精度与基线方法相似或更高 。 这项研究证明视觉变压器在 3D 对象重建任务中的潜力 。