WildRefer：基于多模态视觉数据和自然语言的大型动态场景3D对象定位 (WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language) - 专知论文

会员服务 ·

0

视觉数据 · 动态场景 · 3D · 数据集 · 3D视觉 ·

2023 年 4 月 12 日

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

翻译：WildRefer：基于多模态视觉数据和自然语言的大型动态场景3D对象定位

Zhenxiang Lin,Xidong Peng,Peishan Cong,Yuenan Hou,Xinge Zhu,Sibei Yang,Yuexin Ma

We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data, including 2D images and 3D LiDAR point clouds. We present a novel method, WildRefer, for this task by fully utilizing the appearance features in images, the location and geometry features in point clouds, and the dynamic features in consecutive input frames to match the semantic features in language. In particular, we propose two novel datasets, STRefer and LifeRefer, which focus on large-scale human-centric daily-life scenarios with abundant 3D object and natural language annotations. Our datasets are significant for the research of 3D visual grounding in the wild and has huge potential to boost the development of autonomous driving and service robots. Extensive comparisons and ablation studies illustrate that our method achieves state-of-the-art performance on two proposed datasets. Code and dataset will be released when the paper is published.

翻译：我们介绍了一种通过自然语言描述和在线采集的多模态视觉数据（包括2D图像和3D LiDAR点云）在大型动态场景中3D视觉定位的任务。我们提出了一种新方法WildRefer，通过完全利用图像中的外观特征，点云中的位置和几何特征以及连续输入帧中的动态特征，来匹配语言中的语义特征。特别地，我们提出了两个新的数据集，STRefer和LifeRefer，这两个数据集关注人类中心的大规模日常生活场景，包括丰富的3D物体和自然语言注释。我们的数据集对于野外3D视觉定位的研究具有重大意义，并有巨大的潜力促进自动驾驶和服务机器人的发展。广泛的比较和消融研究表明，我们的方法在两个提出的数据集上取得了最先进的性能。代码和数据集将在文章发表时公开。

0

相关内容

视觉数据

【CVPR2022】多视图聚合的大规模三维语义分割

【CVPR2022】多视图聚合的大规模三维语义分割

专知会员服务

21+阅读 · 2022年4月20日

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

36+阅读 · 2022年3月25日

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

专知会员服务

13+阅读 · 2022年3月19日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日

【香港中文大学-CVPR2020】Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images

【香港中文大学-CVPR2020】Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images

专知会员服务

22+阅读 · 2020年3月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

专知会员服务

31+阅读 · 2020年3月11日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【CVPR 2019 | tutorial】自主汽车的感知、预测和大规模数据采集：Perception, Prediction, and Large Scale Data Collection for Autonomous Cars

【CVPR 2019 | tutorial】自主汽车的感知、预测和大规模数据采集：Perception, Prediction, and Large Scale Data Collection for Autonomous Cars

专知会员服务

33+阅读 · 2019年11月28日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

视频目标检测：Flow-based

视频目标检测：Flow-based

极市平台

22+阅读 · 2019年5月27日

【泡泡一分钟】变化环境下激光地图辅助视觉惯性定位

【泡泡一分钟】变化环境下激光地图辅助视觉惯性定位

泡泡机器人SLAM

15+阅读 · 2019年5月22日

【泡泡一分钟】DS-SLAM: 动态环境下的语义视觉SLAM

【泡泡一分钟】DS-SLAM: 动态环境下的语义视觉SLAM

泡泡机器人SLAM

23+阅读 · 2019年1月18日

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

泡泡机器人SLAM

11+阅读 · 2019年1月4日

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

泡泡机器人SLAM

13+阅读 · 2019年1月3日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【泡泡一分钟】Trifo-VIO：使用点和线的稳健且高效的双目视觉惯导里程计

【泡泡一分钟】Trifo-VIO：使用点和线的稳健且高效的双目视觉惯导里程计

泡泡机器人SLAM

13+阅读 · 2018年12月20日

【泡泡点云时空】基于增量分割的3D点云定位方法（ICRA2018-4）

【泡泡点云时空】基于增量分割的3D点云定位方法（ICRA2018-4）

泡泡机器人SLAM

13+阅读 · 2018年10月7日

【论文推荐】最新5篇视觉目标跟踪相关论文—递归神经网络、深度适应计算策略、视觉目标跟踪基准、深度核化相关滤波、检测并跟踪

【论文推荐】最新5篇视觉目标跟踪相关论文—递归神经网络、深度适应计算策略、视觉目标跟踪基准、深度核化相关滤波、检测并跟踪

专知

14+阅读 · 2018年1月22日

大规模多视角高维图像特征提取

国家自然科学基金

3+阅读 · 2017年12月31日

(Ba,Ca)(Ti,Sn)O3多元体系无铅压电陶瓷的相结构与性能调控研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于RGBD序列的动态物体几何与纹理重建及其数据集建设

国家自然科学基金

0+阅读 · 2013年12月31日

基于视觉的智能机器人场景理解方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

Affordance辅助服务机器人识别形状不规则物体研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于分布式视觉的移动机器人大规模室内环境认知方法研究

国家自然科学基金

3+阅读 · 2012年12月31日

基于Tetrolet变换的偏振遥感图像融合算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于视觉语义推理与上下文约束建模的场景理解方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

区域紧急疏散的决策支持方法研究

国家自然科学基金

0+阅读 · 2011年2月28日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction

Arxiv

0+阅读 · 2023年5月30日

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Arxiv

0+阅读 · 2023年5月30日

Occ-BEV: Multi-Camera Unified Pre-training via 3D Scene Reconstruction

Arxiv

0+阅读 · 2023年5月30日

NeRO: Neural Geometry and BRDF Reconstruction of Reflective Objects from Multiview Images

Arxiv

0+阅读 · 2023年5月27日

GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

Arxiv

0+阅读 · 2023年5月26日

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

Arxiv

11+阅读 · 2023年3月10日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Arxiv

25+阅读 · 2023年2月20日

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Arxiv

11+阅读 · 2021年12月16日

3D Hand Shape and Pose Estimation from a Single RGB Image

3D Hand Shape and Pose Estimation from a Single RGB Image

Arxiv

17+阅读 · 2019年3月3日

Learning Hierarchical Features for Visual Object Tracking with Recursive Neural Networks

Arxiv

13+阅读 · 2018年1月6日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR2022】多视图聚合的大规模三维语义分割

【CVPR2022】多视图聚合的大规模三维语义分割

专知会员服务

21+阅读 · 2022年4月20日

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

36+阅读 · 2022年3月25日

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

专知会员服务

13+阅读 · 2022年3月19日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日

【香港中文大学-CVPR2020】Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images

【香港中文大学-CVPR2020】Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images

专知会员服务

22+阅读 · 2020年3月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

专知会员服务

31+阅读 · 2020年3月11日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【CVPR 2019 | tutorial】自主汽车的感知、预测和大规模数据采集：Perception, Prediction, and Large Scale Data Collection for Autonomous Cars

【CVPR 2019 | tutorial】自主汽车的感知、预测和大规模数据采集：Perception, Prediction, and Large Scale Data Collection for Autonomous Cars

专知会员服务

33+阅读 · 2019年11月28日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】扩展可扩展会话推荐的边界

别想太多：高效 R1 风格大型推理模型综述

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

智能体网络：用AI智能体编织下一代网络

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

视频目标检测：Flow-based

视频目标检测：Flow-based

极市平台

22+阅读 · 2019年5月27日

【泡泡一分钟】变化环境下激光地图辅助视觉惯性定位

【泡泡一分钟】变化环境下激光地图辅助视觉惯性定位

泡泡机器人SLAM

15+阅读 · 2019年5月22日

【泡泡一分钟】DS-SLAM: 动态环境下的语义视觉SLAM

【泡泡一分钟】DS-SLAM: 动态环境下的语义视觉SLAM

泡泡机器人SLAM

23+阅读 · 2019年1月18日

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

泡泡机器人SLAM

11+阅读 · 2019年1月4日

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

泡泡机器人SLAM

13+阅读 · 2019年1月3日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【泡泡一分钟】Trifo-VIO：使用点和线的稳健且高效的双目视觉惯导里程计

【泡泡一分钟】Trifo-VIO：使用点和线的稳健且高效的双目视觉惯导里程计

泡泡机器人SLAM

13+阅读 · 2018年12月20日

【泡泡点云时空】基于增量分割的3D点云定位方法（ICRA2018-4）

【泡泡点云时空】基于增量分割的3D点云定位方法（ICRA2018-4）

泡泡机器人SLAM

13+阅读 · 2018年10月7日

【论文推荐】最新5篇视觉目标跟踪相关论文—递归神经网络、深度适应计算策略、视觉目标跟踪基准、深度核化相关滤波、检测并跟踪

【论文推荐】最新5篇视觉目标跟踪相关论文—递归神经网络、深度适应计算策略、视觉目标跟踪基准、深度核化相关滤波、检测并跟踪

专知

14+阅读 · 2018年1月22日

相关论文

MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction

Arxiv

0+阅读 · 2023年5月30日

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Arxiv

0+阅读 · 2023年5月30日

Occ-BEV: Multi-Camera Unified Pre-training via 3D Scene Reconstruction

Arxiv

0+阅读 · 2023年5月30日

NeRO: Neural Geometry and BRDF Reconstruction of Reflective Objects from Multiview Images

Arxiv

0+阅读 · 2023年5月27日

GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

Arxiv

0+阅读 · 2023年5月26日

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

Arxiv

11+阅读 · 2023年3月10日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Arxiv

25+阅读 · 2023年2月20日

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Arxiv

11+阅读 · 2021年12月16日

3D Hand Shape and Pose Estimation from a Single RGB Image

3D Hand Shape and Pose Estimation from a Single RGB Image

Arxiv

17+阅读 · 2019年3月3日

Learning Hierarchical Features for Visual Object Tracking with Recursive Neural Networks

Arxiv

13+阅读 · 2018年1月6日

相关基金

大规模多视角高维图像特征提取

国家自然科学基金

3+阅读 · 2017年12月31日

(Ba,Ca)(Ti,Sn)O3多元体系无铅压电陶瓷的相结构与性能调控研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于RGBD序列的动态物体几何与纹理重建及其数据集建设

国家自然科学基金

0+阅读 · 2013年12月31日

基于视觉的智能机器人场景理解方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

Affordance辅助服务机器人识别形状不规则物体研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于分布式视觉的移动机器人大规模室内环境认知方法研究

国家自然科学基金

3+阅读 · 2012年12月31日

基于Tetrolet变换的偏振遥感图像融合算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于视觉语义推理与上下文约束建模的场景理解方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

区域紧急疏散的决策支持方法研究

国家自然科学基金

0+阅读 · 2011年2月28日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员