Most modern deep learning-based multi-view 3D reconstruction techniques use RNNs or fusion modules to combine information from multiple images after independently encoding them. These two separate steps have loose connections and do not allow easy information sharing among views. We propose LegoFormer, a transformer model for voxel-based 3D reconstruction that uses the attention layers to share information among views during all computational stages. Moreover, instead of predicting each voxel independently, we propose to parametrize the output with a series of low-rank decomposition factors. This reformulation allows the prediction of an object as a set of independent regular structures then aggregated to obtain the final reconstruction. Experiments conducted on ShapeNet demonstrate the competitive performance of our model with respect to the state of the art while having increased interpretability thanks to the self-attention layers. We also show promising generalization results to real data.
翻译:最现代的基于学习的多视图3D重建技术在独立编码后,使用RNN或聚合模块将多个图像中的信息合并在一起。 这两个单独的步骤彼此关联松散,无法方便地分享观点。 我们建议使用LegoFormer, 一种基于 voxel 的3D重建变压器模型, 在所有计算阶段使用注意层共享观点。 此外, 我们提议不独立预测每个 voxel, 而不是用一系列低级分解因素来对输出进行对称。 这一重拟可以预测一个物体, 由一套独立的常规结构组成, 然后汇总起来进行最后重建。 在 ShapeNet 上进行的实验展示了我们模型在艺术状态方面的竞争性表现, 同时由于自省层而增加了可解释性。 我们还展示了真实数据的有希望的概括性结果。