The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN. Codes will be released at https://github.com/RunpeiDong/ACT.
翻译:深层次学习的成功在很大程度上依赖于具有综合标签的大型数据,与2D图像或自然语言相比,这比以3D获取3D更昂贵,更费时。这有可能促进利用作为跨模式知识转让教师的3D以上数据预培训模型。在本文中,我们以统一的知识蒸馏方式重新审视蒙面模型,用2D图像或自然语言培训的2D图像或自然语言进行蒙面模型,我们表明,基础变换器通过培训自动变换器作为跨模式的3D代表学习,有助于自我监督的3D代表学习。预先培训的变换器是作为跨模式的3D教师转让的,使用离散的自动自动变换自我监督的自我监督视野,在此期间,变换器被冻结,并迅速调整知识继承。3D教师编码的潜在特征被用作遮面点模型的目标,其中黑暗知识被提炼给3D变换器学生,作为基础几何测量理解。我们ACT培训的3D学习器将全局化能力转换为跨下游基准的3D通用能力,在SAR/RB/CAND全面精确度上,在SAR/8821SAN/SARBSARBSEN/GSENSEN/88SEN/88s/88s,将在各SAR/88/EN/88/88/88/88。