Deep CNN-based methods have so far achieved the state of the art results in multi-view 3D object reconstruction. Despite the considerable progress, the two core modules of these methods - multi-view feature extraction and fusion, are usually investigated separately, and the object relations in different views are rarely explored. In this paper, inspired by the recent great success in self-attention-based Transformer models, we reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem and propose a new framework named 3D Volume Transformer (VolT) for such a task. Unlike previous CNN-based methods using a separate design, we unify the feature extraction and view fusion in a single Transformer network. A natural advantage of our design lies in the exploration of view-to-view relationships using self-attention among multiple unordered inputs. On ShapeNet - a large-scale 3D reconstruction benchmark dataset, our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters ($70\%$ less) than other CNN-based methods. Experimental results also suggest the strong scaling capability of our method. Our code will be made publicly available.
翻译:尽管取得了相当大的进展,但这些方法的两个核心模块 -- -- 多视图特征提取和聚合 -- -- 通常都单独调查,而且很少探讨不同观点的对象关系。在本文件中,由于最近在以自我关注为基础的变异器模型方面取得巨大成功,我们重新将多视图 3D 重建作为从序列到序列的预测问题,并提议为这项任务建立一个名为3D 卷变异器(VolT)的新框架。与以前使用单独设计的CNN 方法不同,我们将特征提取和变异组合统一在一个单一的变异器网络中。我们设计的一个自然优势在于利用多种未经排序的投入的自我保护来探索视觉关系。在ShapeNet上,一个大型的3D重建基准数据集,我们的方法在多视图重建中实现了新的状态准确性,其参数比CNN的其他方法少(70美元)。实验结果还表明我们方法的强大缩放能力。我们的代码将公开制作。