CLIP进入三维领域：利用提示调整进行语言驱动的三维识别 (CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition) - 专知论文

会员服务 ·

0

三维识别 · 零样本 · 提取 · 样本 · 识别 ·

2023 年 4 月 4 日

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

翻译：CLIP进入三维领域：利用提示调整进行语言驱动的三维识别

Deepti Hegde,Jeya Maria Jose Valanarasu,Vishal M. Patel

from arxiv, Website: https://jeya-maria-jose.github.io/cg3d-web/

Vision-Language models like CLIP have been widely adopted for various tasks due to their impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric features as it was trained on only images and text by natural language supervision. We work on addressing this limitation and propose a new framework termed CG3D (CLIP Goes 3D) where a 3D encoder is learned to exhibit zero-shot capabilities. CG3D is trained using triplets of pointclouds, corresponding rendered 2D images, and texts using natural language supervision. To align the features in a multimodal embedding space, we utilize contrastive loss on 3D features obtained from the 3D encoder, as well as visual and text features extracted from CLIP. We note that the natural images used to train CLIP and the rendered 2D images in CG3D have a distribution shift. Attempting to train the visual and text encoder to account for this shift results in catastrophic forgetting and a notable decrease in performance. To solve this, we employ prompt tuning and introduce trainable parameters in the input space to shift CLIP towards the 3D pre-training dataset utilized in CG3D. We extensively test our pre-trained CG3D framework and demonstrate its impressive capabilities in zero-shot, open scene understanding, and retrieval tasks. Further, it also serves as strong starting weights for fine-tuning in downstream 3D recognition tasks.

翻译：视觉-语言模型比如CLIP由于具有印象深刻的零样本能力，已经被广泛用于多种任务中。然而，CLIP无法用于提取三维几何特征，因为它仅仅是通过自然语言监督来训练的图像和文本模型。为了解决这一限制，本研究提出了一个新的框架，称为CG3D（CLIP Goes 3D），其中学习一个3D编码器以表现出零样本能力。CG3D使用自然语言监督下的三元组点云、相应的渲染二维图像和文本进行训练。为了对齐多模态嵌入空间中的特征，我们采用对比损失来学习CG3D中从3D编码器中提取的三维特征以及从CLIP中提取的视觉和文本特征。我们注意到，用于训练CLIP的自然图像和CG3D中呈现的二维渲染图像存在分布偏移。试图训练视觉和文本编码器以纠正这种偏移会导致灾难性的遗忘和显著的性能降低。为了解决这个问题，我们使用提示调整并在输入空间中引入可训练参数，将CLIP向CG3D中使用的三维预训练数据集调整。我们广泛测试了我们预训练的CG3D框架，并展示了它在零样本、开放场景理解和检索任务中的出色能力。此外，它还作为下游三维识别任务微调的强大的起始权重。

0

相关内容

三维识别

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

专知会员服务

40+阅读 · 2023年4月11日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

【Google-Mila】你的GAN实际上是一个基于能量的模型，你应该使用鉴别器驱动的潜在采样，Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling

【Google-Mila】你的GAN实际上是一个基于能量的模型，你应该使用鉴别器驱动的潜在采样，Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling

专知会员服务

30+阅读 · 2020年3月28日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

论文浅尝 | 弱监督下极简的视觉语言预训练模型

论文浅尝 | 弱监督下极简的视觉语言预训练模型

开放知识图谱

1+阅读 · 2022年9月26日

IJCAI 2022 | 使用陈述句进行视觉问答的Prompt Tuning

IJCAI 2022 | 使用陈述句进行视觉问答的Prompt Tuning

PaperWeekly

3+阅读 · 2022年9月21日

NAACL 2022 | 基于Prompt的文本生成迁移学习

NAACL 2022 | 基于Prompt的文本生成迁移学习

PaperWeekly

1+阅读 · 2022年8月31日

ECCV 2022 | 视频理解新框架X-CLIP：仅用微调的成本，达到预训练的全能

ECCV 2022 | 视频理解新框架X-CLIP：仅用微调的成本，达到预训练的全能

PaperWeekly

0+阅读 · 2022年8月9日

港科大&MSRA新研究：关于图像到图像转换，Fine-tuning is all you need

港科大&MSRA新研究：关于图像到图像转换，Fine-tuning is all you need

PaperWeekly

0+阅读 · 2022年7月5日

港科大&MSRA新研究：关于图像到图像转换，Finetuning is all you need

港科大&MSRA新研究：关于图像到图像转换，Finetuning is all you need

机器之心

0+阅读 · 2022年6月30日

ACL 2022 | NLP领域最新热门研究，你一定不能错过！

ACL 2022 | NLP领域最新热门研究，你一定不能错过！

微软研究院AI头条

0+阅读 · 2022年5月19日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

【泡泡一分钟】基于多视图卷积网络的草图三维重建技术(3dv-66)

【泡泡一分钟】基于多视图卷积网络的草图三维重建技术(3dv-66)

泡泡机器人SLAM

11+阅读 · 2018年3月31日

【泡泡一分钟】Matterport3D: 从室内RGBD数据集中训练 (3dv-22)

【泡泡一分钟】Matterport3D: 从室内RGBD数据集中训练 (3dv-22)

泡泡机器人SLAM

16+阅读 · 2017年12月31日

哺乳动物中SR蛋白激酶SRPK1通过调控Drosha的磷酸化参与microRNA生物合成的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于空间上下文迁移推理的土地利用图斑变化检测方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

多分辨率相机及图像超分辨率技术研究

国家自然科学基金

2+阅读 · 2014年12月31日

Affordance辅助服务机器人识别形状不规则物体研究

国家自然科学基金

0+阅读 · 2013年12月31日

Rab26介导的受体失衡调控肺微血管内皮屏障损伤的作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

面向自然语言的虚拟地理场景重构方法

国家自然科学基金

0+阅读 · 2012年12月31日

少维输入-多维期望运动输出并联机构的拓扑结构设计理论与方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于广义稀疏表示的异质人脸图像变换和质量评价

国家自然科学基金

0+阅读 · 2011年12月31日

基于2D视频视觉关注度的3D重建方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

构建ASPP2/P53缺失的GP120转基因鼠研究HAD神经凋亡机制

国家自然科学基金

0+阅读 · 2008年12月31日

Privacy Implications of Retrieval-Based Language Models

Arxiv

0+阅读 · 2023年5月24日

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Arxiv

1+阅读 · 2023年5月23日

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Arxiv

0+阅读 · 2023年5月23日

Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

Arxiv

0+阅读 · 2023年5月23日

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

Arxiv

0+阅读 · 2023年5月22日

An efficient deep learning model to categorize brain tumor using reconstruction and fine-tuning

Arxiv

0+阅读 · 2023年5月22日

The CLIP Model is Secretly an Image-to-Prompt Converter

Arxiv

0+阅读 · 2023年5月22日

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Arxiv

0+阅读 · 2023年5月21日

Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots

Arxiv

1+阅读 · 2023年5月19日

Conditional Prompt Learning for Vision-Language Models

Conditional Prompt Learning for Vision-Language Models

Arxiv

13+阅读 · 2022年3月10日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

专知会员服务

40+阅读 · 2023年4月11日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

【Google-Mila】你的GAN实际上是一个基于能量的模型，你应该使用鉴别器驱动的潜在采样，Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling

【Google-Mila】你的GAN实际上是一个基于能量的模型，你应该使用鉴别器驱动的潜在采样，Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling

专知会员服务

30+阅读 · 2020年3月28日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

论文浅尝 | 弱监督下极简的视觉语言预训练模型

论文浅尝 | 弱监督下极简的视觉语言预训练模型

开放知识图谱

1+阅读 · 2022年9月26日

IJCAI 2022 | 使用陈述句进行视觉问答的Prompt Tuning

IJCAI 2022 | 使用陈述句进行视觉问答的Prompt Tuning

PaperWeekly

3+阅读 · 2022年9月21日

NAACL 2022 | 基于Prompt的文本生成迁移学习

NAACL 2022 | 基于Prompt的文本生成迁移学习

PaperWeekly

1+阅读 · 2022年8月31日

ECCV 2022 | 视频理解新框架X-CLIP：仅用微调的成本，达到预训练的全能

ECCV 2022 | 视频理解新框架X-CLIP：仅用微调的成本，达到预训练的全能

PaperWeekly

0+阅读 · 2022年8月9日

港科大&MSRA新研究：关于图像到图像转换，Fine-tuning is all you need

港科大&MSRA新研究：关于图像到图像转换，Fine-tuning is all you need

PaperWeekly

0+阅读 · 2022年7月5日

港科大&MSRA新研究：关于图像到图像转换，Finetuning is all you need

港科大&MSRA新研究：关于图像到图像转换，Finetuning is all you need

机器之心

0+阅读 · 2022年6月30日

ACL 2022 | NLP领域最新热门研究，你一定不能错过！

ACL 2022 | NLP领域最新热门研究，你一定不能错过！

微软研究院AI头条

0+阅读 · 2022年5月19日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

【泡泡一分钟】基于多视图卷积网络的草图三维重建技术(3dv-66)

【泡泡一分钟】基于多视图卷积网络的草图三维重建技术(3dv-66)

泡泡机器人SLAM

11+阅读 · 2018年3月31日

【泡泡一分钟】Matterport3D: 从室内RGBD数据集中训练 (3dv-22)

【泡泡一分钟】Matterport3D: 从室内RGBD数据集中训练 (3dv-22)

泡泡机器人SLAM

16+阅读 · 2017年12月31日

相关论文

Privacy Implications of Retrieval-Based Language Models

Arxiv

0+阅读 · 2023年5月24日

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Arxiv

1+阅读 · 2023年5月23日

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Arxiv

0+阅读 · 2023年5月23日

Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

Arxiv

0+阅读 · 2023年5月23日

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

Arxiv

0+阅读 · 2023年5月22日

An efficient deep learning model to categorize brain tumor using reconstruction and fine-tuning

Arxiv

0+阅读 · 2023年5月22日

The CLIP Model is Secretly an Image-to-Prompt Converter

Arxiv

0+阅读 · 2023年5月22日

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Arxiv

0+阅读 · 2023年5月21日

Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots

Arxiv

1+阅读 · 2023年5月19日

Conditional Prompt Learning for Vision-Language Models

Conditional Prompt Learning for Vision-Language Models

Arxiv

13+阅读 · 2022年3月10日

相关基金

哺乳动物中SR蛋白激酶SRPK1通过调控Drosha的磷酸化参与microRNA生物合成的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于空间上下文迁移推理的土地利用图斑变化检测方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

多分辨率相机及图像超分辨率技术研究

国家自然科学基金

2+阅读 · 2014年12月31日

Affordance辅助服务机器人识别形状不规则物体研究

国家自然科学基金

0+阅读 · 2013年12月31日

Rab26介导的受体失衡调控肺微血管内皮屏障损伤的作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

面向自然语言的虚拟地理场景重构方法

国家自然科学基金

0+阅读 · 2012年12月31日

少维输入-多维期望运动输出并联机构的拓扑结构设计理论与方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于广义稀疏表示的异质人脸图像变换和质量评价

国家自然科学基金

0+阅读 · 2011年12月31日

基于2D视频视觉关注度的3D重建方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

构建ASPP2/P53缺失的GP120转基因鼠研究HAD神经凋亡机制

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员