Pure Transformer models have achieved impressive success in natural language processing and computer vision. However, one limitation with Transformers is their need for large training data. In the realm of 3D point clouds, the availability of large datasets is a challenge, which exacerbates the issue of training Transformers for 3D tasks. In this work, we empirically study and investigate the effect of utilizing knowledge from a large number of images for point cloud understanding. We formulate a pipeline dubbed \textit{Pix4Point} that allows harnessing pretrained Transformers in the image domain to improve downstream point cloud tasks. This is achieved by a modality-agnostic pure Transformer backbone with the help of tokenizer and decoder layers specialized in the 3D domain. Using image-pretrained Transformers, we observe significant performance gains of Pix4Point on the tasks of 3D point cloud classification, part segmentation, and semantic segmentation on ScanObjectNN, ShapeNetPart, and S3DIS benchmarks, respectively. Our code and models are available at: \url{https://github.com/guochengqian/Pix4Point}.
翻译:纯变换模型在自然语言处理和计算机视觉方面取得了令人印象深刻的成功。 但是,变换器的一个局限性是它们需要大量培训数据。 在3D点云领域,提供大型数据集是一项挑战,这加剧了为3D任务培训变换器的问题。在这项工作中,我们从经验上研究和调查利用大量图像获得知识对点云理解的影响。我们开发了一条管道,名为\textit{Pix4Point},允许在图像域内利用预先训练的变换器改进下游点云的任务。这是由3D域专门化的代号器和解码层的帮助下,通过一个模式-Agnotical纯变换器骨架实现的。我们使用受图像训练的变换器,观察到3D点云分类、部分分解和ScanObjectN、 ShapeNetPart 和 S3DIS基准中的语系分化部分。我们的代码和模型分别见于\ url{https://github.com/guochengqian/Pix4Gen。