Vision Transformers (ViTs) have proven to be effective, in solving 2D image understanding tasks by training over large-scale image datasets; and meanwhile as a somehow separate track, in modeling the 3D visual world too such as voxels or point clouds. However, with the growing hope that transformers can become the "universal" modeling tool for heterogeneous data, ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable. That invites an (over-)ambitious question: can we close the gap between the 2D and 3D ViT architectures? As a piloting study, this paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture, with only minimal customization at the input and output levels without redesigning the pipeline. To build a 3D ViT from its 2D sibling, we "inflate" the patch embedding and token sequence, accompanied with new positional encoding mechanisms designed to match the 3D data geometry. The resultant "minimalist" 3D ViT, named Simple3D-Former, performs surprisingly robustly on popular 3D tasks such as object classification, point cloud segmentation and indoor scene detection, compared to highly customized 3D-specific designs. It can hence act as a strong baseline for new 3D ViTs. Moreover, we note that pursing a unified 2D-3D ViT design has practical relevance besides just scientific curiosity. Specifically, we demonstrate that Simple3D-Former naturally enables to exploit the wealth of pre-trained weights from large-scale realistic 2D images (e.g., ImageNet), which can be plugged in to enhancing the 3D task performance "for free".
翻译:视觉变异器( ViTs ) 已证明是有效的, 通过大规模图像数据集的培训解决 2D 图像理解任务( 2D - 3D 图像网络) 。 同时, 作为一种以某种方式分开的轨道, 建模 3D 视觉世界, 也像 voxels 或点云一样。 然而, 随着变异器成为“ 通用” 数据模型工具的希望日益增强, 2D 和 3D 任务VT 迄今已采用了极不易转让的巨型结构设计。 这就引出了一个( 超) 雄心勃勃的问题: 我们能否通过 2D 和 3D ViT 结构来缩小 2D 图像之间的距离? 作为实验性研究, 本文展示了理解 3D 3D 视觉世界的吸引力承诺, 使用标准 2D ViT 结构, 只需在输入最小化的输入的输入量和输出值上最小化的 3D 。 为了从 2D 之前建立3D 直观的新的定位, 我们的直观分析, 可以将3D 直观的直观分析。