In recent years, many video tasks have achieved breakthroughs by utilizing the vision transformer and establishing spatial-temporal decoupling for feature extraction. Although multi-view 3D reconstruction also faces multiple images as input, it cannot immediately inherit their success due to completely ambiguous associations between unordered views. There is not usable prior relationship, which is similar to the temporally-coherence property in a video. To solve this problem, we propose a novel transformer network for Unordered Multiple Images (UMIFormer). It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification that mine the correlation between similar tokens from different views to achieve decoupled inter-view encoding. Afterward, all tokens acquired from various branches are compressed into a fixed-size compact representation while preserving rich information for reconstruction by leveraging the similarities between tokens. We empirically demonstrate on ShapeNet and confirm that our decoupled learning method is adaptable for unordered multiple images. Meanwhile, the experiments also verify our model outperforms existing SOTA methods by a large margin.
翻译:近年来,许多视频任务通过利用视觉变压器和为地貌提取建立空间时空脱钩而实现了突破。虽然多视图 3D 重建也面临多个图像作为输入,但由于未排序观点之间完全模糊的关联,它无法立即继承成功。 与视频中的时间一致性属性相似, 先前没有可用的关系。 为了解决这个问题, 我们提议为无顺序多重图像建立一个新型变压器网络( UMIFORmer ) 。 它利用变压器块进行分解内部编码, 并设计符号校正块, 以将不同观点的类似标牌连接起来, 从而实现拆解视图之间的编码。 之后, 从各分支获得的所有标牌都压缩为固定规模的缩装, 同时通过利用符号之间的相似性来保存丰富的重建信息。 我们在 ShapeNet 上进行了实验性演示, 并证实我们脱序的学习方法能够适应未排序多个图像。 与此同时, 实验还核实了我们的模型比现有的SATA方法大边缘。</s>