Joint-MAE: 2D-3D联合掩码自编码器用于三维点云预训练 (Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training) - 专知论文

会员服务 ·

0

掩码自编码MAE · 3D · 点云 · 掩码自编码器 · 模态 ·

2023 年 3 月 30 日

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

翻译：Joint-MAE: 2D-3D联合掩码自编码器用于三维点云预训练

Ziyu Guo,Renrui Zhang,Longtian Qiu,Xianzhi Li,Pheng-Ann Heng

from arxiv, 10 pages, 5 figures

Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for both 2D and 3D computer vision. However, existing MAE-style methods can only learn from the data of a single modality, i.e., either images or point clouds, which neglect the implicit semantic and geometric correlation between 2D and 3D. In this paper, we explore how the 2D modality can benefit 3D masked autoencoding, and propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training. Joint-MAE randomly masks an input 3D point cloud and its projected 2D images, and then reconstructs the masked information of the two modalities. For better cross-modal interaction, we construct our JointMAE by two hierarchical 2D-3D embedding modules, a joint encoder, and a joint decoder with modal-shared and model-specific decoders. On top of this, we further introduce two cross-modal strategies to boost the 3D representation learning, which are local-aligned attention mechanisms for 2D-3D semantic cues, and a cross-reconstruction loss for 2D-3D geometric constraints. By our pre-training paradigm, Joint-MAE achieves superior performance on multiple downstream tasks, e.g., 92.4% accuracy for linear SVM on ModelNet40 and 86.07% accuracy on the hardest split of ScanObjectNN.

翻译：掩码自编码器（MAE）在自监督学习中对于2D和3D计算机视觉都表现出了很高的性能。但是，现有的MAE式方法只能从单一的模态数据（即图像或点云）中学习，这忽略了2D和3D之间的隐含语义和几何关系。在本文中，我们探讨了2D模态如何促进3D掩蔽自编码的发展，并提出了Joint-MAE，一种用于自监督3D点云预训练的2D-3D联合MAE框架。Joint-MAE随机地掩盖输入的3D点云和其投影的2D图像，然后重构两个模态的掩蔽信息。为了更好地进行跨模态交互，我们通过两个分层的2D-3D嵌入模块、共享编码器和模态特化解码器构建了我们的JointMAE。此外，我们进一步引入了两种跨模态策略来提升3D表示学习，即基于局部对齐的2D-3D语义提示机制和基于2D-3D几何约束的交叉重构损失。通过我们的预训练范例，Joint-MAE在多种下游任务上取得了卓越的性能，例如ModelNet40中线性SVM的92.4％准确率和ScanObjectNN最难分裂的86.07％准确率。

1

相关内容

掩码自编码MAE

掩码自编码MAE

掩码自编码MAE

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

专知会员服务

24+阅读 · 2023年4月9日

自动化所11篇NeurIPS 2022新作速览！

自动化所11篇NeurIPS 2022新作速览！

专知会员服务

40+阅读 · 2022年10月5日

【何恺明组新论文】掩码自编码器作为时空学习器，Masked Autoencoders As Spatiotemporal Learners

【何恺明组新论文】掩码自编码器作为时空学习器，Masked Autoencoders As Spatiotemporal Learners

专知会员服务

39+阅读 · 2022年5月19日

【CVPR2022】三元组对比学习的视觉-语言预训练

【CVPR2022】三元组对比学习的视觉-语言预训练

专知会员服务

33+阅读 · 2022年3月3日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【ICML2020】统一预训练伪掩码语言模型

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日

【CVPR2020-Facebook】从检测到3D目标，FroDO: From Detections to 3D Objects

【CVPR2020-Facebook】从检测到3D目标，FroDO: From Detections to 3D Objects

专知会员服务

33+阅读 · 2020年5月12日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

专知会员服务

30+阅读 · 2020年1月2日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

专知

2+阅读 · 2023年4月9日

自动化所11篇NeurIPS 2022新作速览！

自动化所11篇NeurIPS 2022新作速览！

专知

0+阅读 · 2022年10月5日

何恺明团队的“视频版本MAE”，高效视频预训练！Mask Ratio高达90%时效果也很好！

何恺明团队的“视频版本MAE”，高效视频预训练！Mask Ratio高达90%时效果也很好！

夕小瑶的卖萌屋

0+阅读 · 2022年6月14日

CVPR2019 | 15篇论文速递（涵盖目标检测、语义分割和姿态估计等方向）

CVPR2019 | 15篇论文速递（涵盖目标检测、语义分割和姿态估计等方向）

AI研习社

15+阅读 · 2019年5月8日

【泡泡点云时空】3DFeat-Net：用于点云配准的弱监督学习的局部3D特征（ECCV2018-3）

【泡泡点云时空】3DFeat-Net：用于点云配准的弱监督学习的局部3D特征（ECCV2018-3）

泡泡机器人SLAM

12+阅读 · 2018年10月2日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【泡泡点云时空】用于点云识别的注意力形状上下文网络（CVPR2018-1）

【泡泡点云时空】用于点云识别的注意力形状上下文网络（CVPR2018-1）

泡泡机器人SLAM

33+阅读 · 2018年8月6日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

专知

31+阅读 · 2018年6月4日

【泡泡一分钟】基于多视图卷积网络的草图三维重建技术(3dv-66)

【泡泡一分钟】基于多视图卷积网络的草图三维重建技术(3dv-66)

泡泡机器人SLAM

11+阅读 · 2018年3月31日

空间非合作目标基于点云模型的视觉与惯性融合相对导航方法与实验研究

国家自然科学基金

17+阅读 · 2015年12月31日

基于分层图结构化稀疏低秩表示的目标联合分割方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

行人检测中粒度空间特征提取方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于阴影恢复技术的SAR三维重建与目标检测方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

复杂场景视觉注意对象分割方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

面向位置服务的道路网增量更新信息的多尺度传递与融合

国家自然科学基金

0+阅读 · 2012年12月31日

基于先验知识的三维点云鲁棒处理技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于半监督结构化学习的跨语言映射研究

国家自然科学基金

2+阅读 · 2011年12月31日

基于超高分辨率视频的HEVC低复杂度模型和方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

压缩域图像大容量无损信息隐藏技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

Points2Sound: From mono to binaural audio using 3D point cloud scenes

Arxiv

0+阅读 · 2023年5月19日

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Arxiv

0+阅读 · 2023年5月19日

SurgMAE: Masked Autoencoders for Long Surgical Video Analysis

Arxiv

0+阅读 · 2023年5月19日

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Arxiv

0+阅读 · 2023年5月18日

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Arxiv

0+阅读 · 2023年5月18日

CloudWalker: Random walks for 3D point cloud shape analysis

Arxiv

0+阅读 · 2023年5月17日

Masked Autoencoders Are Scalable Vision Learners

Arxiv

27+阅读 · 2021年11月11日

K-Net: Towards Unified Image Segmentation

Arxiv

12+阅读 · 2021年11月1日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss

Arxiv

10+阅读 · 2018年4月29日

VIP会员

文章信息

相关主题

掩码自编码MAE

掩码自编码器

相关VIP内容

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

专知会员服务

24+阅读 · 2023年4月9日

自动化所11篇NeurIPS 2022新作速览！

自动化所11篇NeurIPS 2022新作速览！

专知会员服务

40+阅读 · 2022年10月5日

【何恺明组新论文】掩码自编码器作为时空学习器，Masked Autoencoders As Spatiotemporal Learners

【何恺明组新论文】掩码自编码器作为时空学习器，Masked Autoencoders As Spatiotemporal Learners

专知会员服务

39+阅读 · 2022年5月19日

【CVPR2022】三元组对比学习的视觉-语言预训练

【CVPR2022】三元组对比学习的视觉-语言预训练

专知会员服务

33+阅读 · 2022年3月3日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【ICML2020】统一预训练伪掩码语言模型

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日

【CVPR2020-Facebook】从检测到3D目标，FroDO: From Detections to 3D Objects

【CVPR2020-Facebook】从检测到3D目标，FroDO: From Detections to 3D Objects

专知会员服务

33+阅读 · 2020年5月12日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

专知会员服务

30+阅读 · 2020年1月2日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《乌克兰无人机产业：志愿者与政策在构建新兴无人机产业中的协同作用》最新报告

《人工智能辅助决策中的数据可视化：系统性综述》

人工智能驱动弹药制造现代化：美国陆军转型之路

《敏捷作战部署中枢纽-辐条基地选址优化研究》80页

相关资讯

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

专知

2+阅读 · 2023年4月9日

自动化所11篇NeurIPS 2022新作速览！

自动化所11篇NeurIPS 2022新作速览！

专知

0+阅读 · 2022年10月5日

何恺明团队的“视频版本MAE”，高效视频预训练！Mask Ratio高达90%时效果也很好！

何恺明团队的“视频版本MAE”，高效视频预训练！Mask Ratio高达90%时效果也很好！

夕小瑶的卖萌屋

0+阅读 · 2022年6月14日

CVPR2019 | 15篇论文速递（涵盖目标检测、语义分割和姿态估计等方向）

CVPR2019 | 15篇论文速递（涵盖目标检测、语义分割和姿态估计等方向）

AI研习社

15+阅读 · 2019年5月8日

【泡泡点云时空】3DFeat-Net：用于点云配准的弱监督学习的局部3D特征（ECCV2018-3）

【泡泡点云时空】3DFeat-Net：用于点云配准的弱监督学习的局部3D特征（ECCV2018-3）

泡泡机器人SLAM

12+阅读 · 2018年10月2日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【泡泡点云时空】用于点云识别的注意力形状上下文网络（CVPR2018-1）

【泡泡点云时空】用于点云识别的注意力形状上下文网络（CVPR2018-1）

泡泡机器人SLAM

33+阅读 · 2018年8月6日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

专知

31+阅读 · 2018年6月4日

【泡泡一分钟】基于多视图卷积网络的草图三维重建技术(3dv-66)

【泡泡一分钟】基于多视图卷积网络的草图三维重建技术(3dv-66)

泡泡机器人SLAM

11+阅读 · 2018年3月31日

相关论文

Points2Sound: From mono to binaural audio using 3D point cloud scenes

Arxiv

0+阅读 · 2023年5月19日

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Arxiv

0+阅读 · 2023年5月19日

SurgMAE: Masked Autoencoders for Long Surgical Video Analysis

Arxiv

0+阅读 · 2023年5月19日

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Arxiv

0+阅读 · 2023年5月18日

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Arxiv

0+阅读 · 2023年5月18日

CloudWalker: Random walks for 3D point cloud shape analysis

Arxiv

0+阅读 · 2023年5月17日

Masked Autoencoders Are Scalable Vision Learners

Arxiv

27+阅读 · 2021年11月11日

K-Net: Towards Unified Image Segmentation

Arxiv

12+阅读 · 2021年11月1日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss

Arxiv

10+阅读 · 2018年4月29日

相关基金

空间非合作目标基于点云模型的视觉与惯性融合相对导航方法与实验研究

国家自然科学基金

17+阅读 · 2015年12月31日

基于分层图结构化稀疏低秩表示的目标联合分割方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

行人检测中粒度空间特征提取方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于阴影恢复技术的SAR三维重建与目标检测方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

复杂场景视觉注意对象分割方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

面向位置服务的道路网增量更新信息的多尺度传递与融合

国家自然科学基金

0+阅读 · 2012年12月31日

基于先验知识的三维点云鲁棒处理技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于半监督结构化学习的跨语言映射研究

国家自然科学基金

2+阅读 · 2011年12月31日

基于超高分辨率视频的HEVC低复杂度模型和方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

压缩域图像大容量无损信息隐藏技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员