GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration - 专知论文

会员服务 ·

0

GPT-4V · 多峰值 · 机器人 · Vision · 语言模型化 ·

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

翻译：暂无翻译

Naoki Wake,Atsushi Kanehira,Kazuhiro Sasabuchi,Jun Takamatsu,Katsushi Ikeuchi

from arxiv, 8 pages, 10 figures, 3 tables. Published in IEEE Robotics and Automation Letters (RA-L) (in press). Last updated on September 26th, 2024

We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos. Objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

翻译：暂无翻译

0

相关内容

GPT-4V

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

44+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

30+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

77+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

28+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

47+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

34+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

58+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

56+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

30+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

151+阅读 · 2019年10月12日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

26+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

27+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

42+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

16+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

14+阅读 · 2018年5月29日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

11+阅读 · 2018年3月15日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

72+阅读 · 2016年11月26日

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

10+阅读 · 2017年12月31日

“Fishes-in-net” 酵母孢子微胶囊式近平滑假丝酵母SCRII酶有机相高效手性合成机制研究

国家自然科学基金

2+阅读 · 2016年12月31日

Musielak-Orlicz-Sobolev 空间中的迹嵌入及其应用

国家自然科学基金

1+阅读 · 2015年12月31日

Volterra积分微分方程的多区间Chebyshev和Legendre谱配置法

国家自然科学基金

0+阅读 · 2015年12月31日

Weyl半金属TaAs/NbAs在极端条件下的输运性质研究

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

“杰文斯”悖论、能效政策改进与“双控目标”分解

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

1+阅读 · 2014年12月31日

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Arxiv

0+阅读 · 11月5日

AmbigNLG: Addressing Task Ambiguity in Instruction for NLG

Arxiv

0+阅读 · 11月4日

EMMA: End-to-End Multimodal Model for Autonomous Driving

Arxiv

0+阅读 · 11月4日

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Arxiv

0+阅读 · 11月4日

LLM4PR: Improving Post-Ranking in Search Engine with Large Language Models

Arxiv

0+阅读 · 11月2日

CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities

Arxiv

0+阅读 · 11月1日

ViT-LCA: A Neuromorphic Approach for Vision Transformers

Arxiv

0+阅读 · 10月31日

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Arxiv

14+阅读 · 2021年2月9日

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Arxiv

19+阅读 · 2020年3月31日

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Arxiv

17+阅读 · 2018年3月20日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

44+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

30+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

77+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

28+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

47+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

34+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

58+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

56+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

30+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

151+阅读 · 2019年10月12日

热门VIP内容

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

26+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

27+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

42+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

16+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

14+阅读 · 2018年5月29日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

11+阅读 · 2018年3月15日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

72+阅读 · 2016年11月26日

相关论文

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Arxiv

0+阅读 · 11月5日

AmbigNLG: Addressing Task Ambiguity in Instruction for NLG

Arxiv

0+阅读 · 11月4日

EMMA: End-to-End Multimodal Model for Autonomous Driving

Arxiv

0+阅读 · 11月4日

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Arxiv

0+阅读 · 11月4日

LLM4PR: Improving Post-Ranking in Search Engine with Large Language Models

Arxiv

0+阅读 · 11月2日

CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities

Arxiv

0+阅读 · 11月1日

ViT-LCA: A Neuromorphic Approach for Vision Transformers

Arxiv

0+阅读 · 10月31日

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Arxiv

14+阅读 · 2021年2月9日

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

Arxiv

19+阅读 · 2020年3月31日

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Arxiv

17+阅读 · 2018年3月20日

相关基金

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

10+阅读 · 2017年12月31日

“Fishes-in-net” 酵母孢子微胶囊式近平滑假丝酵母SCRII酶有机相高效手性合成机制研究

国家自然科学基金

2+阅读 · 2016年12月31日

Musielak-Orlicz-Sobolev 空间中的迹嵌入及其应用

国家自然科学基金

1+阅读 · 2015年12月31日

Volterra积分微分方程的多区间Chebyshev和Legendre谱配置法

国家自然科学基金

0+阅读 · 2015年12月31日

Weyl半金属TaAs/NbAs在极端条件下的输运性质研究

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

“杰文斯”悖论、能效政策改进与“双控目标”分解

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

海量Web用户生成内容物化关键技术

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员