联合视觉定位和跟踪与自然语言说明 (Joint Visual Grounding and Tracking with Natural Language Specification) - 专知论文

会员服务 ·

0

分离的 · MoDELS · Extensibility · Guidance · INFORMS ·

2023 年 3 月 21 日

Joint Visual Grounding and Tracking with Natural Language Specification

翻译：联合视觉定位和跟踪与自然语言说明

Li Zhou,Zikun Zhou,Kaige Mao,Zhenyu He

from arxiv, Accepted by CVPR 2023

Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. Existing algorithms solve this issue in two steps, visual grounding and tracking, and accordingly deploy the separated grounding model and tracking model to implement these two steps, respectively. Such a separated framework overlooks the link between visual grounding and tracking, which is that the natural language descriptions provide global semantic cues for localizing the target for both two steps. Besides, the separated framework can hardly be trained end-to-end. To handle these issues, we propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task: localizing the referred target based on the given visual-language references. Specifically, we propose a multi-source relation modeling module to effectively build the relation between the visual-language references and the test image. In addition, we design a temporal modeling module to provide a temporal clue with the guidance of the global semantic information for our model, which effectively improves the adaptability to the appearance variations of the target. Extensive experimental results on TNL2K, LaSOT, OTB99, and RefCOCOg demonstrate that our method performs favorably against state-of-the-art algorithms for both tracking and grounding. Code is available at https://github.com/lizhou-cs/JointNLT.

翻译：跟踪自然语言说明旨在根据自然语言描述在序列中定位所指目标。现有算法通过两个步骤解决此问题，即视觉定位和跟踪，并相应地部署分离的定位模型和跟踪模型来执行这两个步骤。这种分离的框架忽略了视觉定位和跟踪之间的联系，即自然语言说明为两个步骤都提供了全局语义线索以定位目标。此外，这种分离的框架很难进行端到端的训练。为了解决这些问题，我们提出了一个联合视觉定位和跟踪框架，该框架将定位和跟踪重新定义为一个统一的任务：根据给定的视觉语言参考定位所指目标。具体地，我们提出了一个多源关系建模模块，有效地建立了视觉语言参考和测试图像之间的关系。此外，我们设计了一个时间建模模块，以全局语义信息的指导提供时间线索，从而有效地提高了我们的模型对目标外观变化的适应性。在 TNL2K，LaSOT，OTB99 和 RefCOCOg 上的广泛实验结果表明，我们的方法在跟踪和定位方面均表现出色。代码可在 https://github.com/lizhou-cs/JointNLT 上获取。

0

相关内容

分离的

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

36+阅读 · 2022年3月25日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日

【香港中文大学】基于Aspect的情感分析综述论文，A Survey on Aspect-Based Sentiment Analysis: Tasks, Methods, and Challenges

【香港中文大学】基于Aspect的情感分析综述论文，A Survey on Aspect-Based Sentiment Analysis: Tasks, Methods, and Challenges

专知会员服务

20+阅读 · 2022年3月3日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

专知会员服务

37+阅读 · 2020年3月27日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

专知会员服务

31+阅读 · 2019年11月17日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

20+阅读 · 2019年10月9日

【CVPR 2019 | tutorial】视觉识别Visual Recognition and Beyond，Facebook|Ross Girshick，Justin Johnson（李飞飞高徒）

【CVPR 2019 | tutorial】视觉识别Visual Recognition and Beyond，Facebook|Ross Girshick，Justin Johnson（李飞飞高徒）

专知会员服务

29+阅读 · 2019年6月16日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

ECCV 2022 | ByteTrack: 简单高效的数据关联方法

ECCV 2022 | ByteTrack: 简单高效的数据关联方法

PaperWeekly

0+阅读 · 2022年8月1日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【泡泡汇总】CVPR2019 SLAM Paperlist

【泡泡汇总】CVPR2019 SLAM Paperlist

泡泡机器人SLAM

14+阅读 · 2019年6月12日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【跟踪Tracking】15篇论文+代码 | 中秋快乐~

【跟踪Tracking】15篇论文+代码 | 中秋快乐~

专知

18+阅读 · 2018年9月24日

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

专知

13+阅读 · 2018年5月26日

【论文推荐】最新5篇目标跟踪（Object Tracking）相关论文—并行跟踪和验证、光流、自动跟踪、相关滤波集成、CFNet

【论文推荐】最新5篇目标跟踪（Object Tracking）相关论文—并行跟踪和验证、光流、自动跟踪、相关滤波集成、CFNet

专知

25+阅读 · 2018年2月6日

跟踪器融合的视觉跟踪方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

视频的中层视觉表达和高层行为识别研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于视觉注意和稀疏表示的行人检测与跟踪方法研究

国家自然科学基金

3+阅读 · 2013年12月31日

视觉原理指导下的动目标检测与跟踪新方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于深度信息面向主动视觉任务的视觉目标遮挡检测与规避方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

面向OTHR目标跟踪的多路径PHD滤波算法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于曲面柔韧度的三维形状局部特征描述符研究

国家自然科学基金

0+阅读 · 2012年12月31日

新型抗生素Bagremycins生物合成基因簇的鉴定与解析

国家自然科学基金

0+阅读 · 2012年12月31日

Fucí意义下的跨共振的Sturm-Liouville问题

国家自然科学基金

0+阅读 · 2012年12月31日

siRNA基因沉默与诱导双向基因治疗关节炎的软骨、滑膜生物学响应及ex vivo系统转基因在体示踪研究

国家自然科学基金

0+阅读 · 2011年12月31日

Language modeling via stochastic processes

Arxiv

0+阅读 · 2023年5月11日

Streaming Joint Speech Recognition and Disfluency Detection

Arxiv

0+阅读 · 2023年5月11日

Dialogue Planning via Brownian Bridge Stochastic Process for Goal-directed Proactive Dialogue

Arxiv

0+阅读 · 2023年5月9日

Multimodal Prompting with Missing Modalities for Visual Recognition

Arxiv

11+阅读 · 2023年3月6日

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective

Arxiv

21+阅读 · 2022年9月27日

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Arxiv

14+阅读 · 2022年1月20日

Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark

Arxiv

14+阅读 · 2021年11月11日

Deep Learning for UAV-based Object Detection and Tracking: A Survey

Arxiv

61+阅读 · 2021年10月25日

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Arxiv

15+阅读 · 2020年3月31日

Commonsense Reasoning for Natural Language Understanding: A Survey of Benchmarks, Resources, and Approaches

Arxiv

16+阅读 · 2019年4月2日

VIP会员

文章信息

相关主题

相关VIP内容

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

36+阅读 · 2022年3月25日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日

【香港中文大学】基于Aspect的情感分析综述论文，A Survey on Aspect-Based Sentiment Analysis: Tasks, Methods, and Challenges

【香港中文大学】基于Aspect的情感分析综述论文，A Survey on Aspect-Based Sentiment Analysis: Tasks, Methods, and Challenges

专知会员服务

20+阅读 · 2022年3月3日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

专知会员服务

37+阅读 · 2020年3月27日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

专知会员服务

31+阅读 · 2019年11月17日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

20+阅读 · 2019年10月9日

【CVPR 2019 | tutorial】视觉识别Visual Recognition and Beyond，Facebook|Ross Girshick，Justin Johnson（李飞飞高徒）

【CVPR 2019 | tutorial】视觉识别Visual Recognition and Beyond，Facebook|Ross Girshick，Justin Johnson（李飞飞高徒）

专知会员服务

29+阅读 · 2019年6月16日

热门VIP内容

开通专知VIP会员享更多权益服务

《电磁（电子）战：英国能力》最新32页报告

《美军条令：斯特赖克步兵步枪排与班作战条令》最新450页

《美海军分布式海上作战（DMO）概念：最新情况》

《跨时空与跨模态学习事件模式构建体系（LESTAT）》57页DARPA研究报告

相关资讯

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

ECCV 2022 | ByteTrack: 简单高效的数据关联方法

ECCV 2022 | ByteTrack: 简单高效的数据关联方法

PaperWeekly

0+阅读 · 2022年8月1日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【泡泡汇总】CVPR2019 SLAM Paperlist

【泡泡汇总】CVPR2019 SLAM Paperlist

泡泡机器人SLAM

14+阅读 · 2019年6月12日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【跟踪Tracking】15篇论文+代码 | 中秋快乐~

【跟踪Tracking】15篇论文+代码 | 中秋快乐~

专知

18+阅读 · 2018年9月24日

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

专知

13+阅读 · 2018年5月26日

【论文推荐】最新5篇目标跟踪（Object Tracking）相关论文—并行跟踪和验证、光流、自动跟踪、相关滤波集成、CFNet

【论文推荐】最新5篇目标跟踪（Object Tracking）相关论文—并行跟踪和验证、光流、自动跟踪、相关滤波集成、CFNet

专知

25+阅读 · 2018年2月6日

相关论文

Language modeling via stochastic processes

Arxiv

0+阅读 · 2023年5月11日

Streaming Joint Speech Recognition and Disfluency Detection

Arxiv

0+阅读 · 2023年5月11日

Dialogue Planning via Brownian Bridge Stochastic Process for Goal-directed Proactive Dialogue

Arxiv

0+阅读 · 2023年5月9日

Multimodal Prompting with Missing Modalities for Visual Recognition

Arxiv

11+阅读 · 2023年3月6日

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective

Arxiv

21+阅读 · 2022年9月27日

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Arxiv

14+阅读 · 2022年1月20日

Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark

Arxiv

14+阅读 · 2021年11月11日

Deep Learning for UAV-based Object Detection and Tracking: A Survey

Arxiv

61+阅读 · 2021年10月25日

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Arxiv

15+阅读 · 2020年3月31日

Commonsense Reasoning for Natural Language Understanding: A Survey of Benchmarks, Resources, and Approaches

Arxiv

16+阅读 · 2019年4月2日

相关基金

跟踪器融合的视觉跟踪方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

视频的中层视觉表达和高层行为识别研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于视觉注意和稀疏表示的行人检测与跟踪方法研究

国家自然科学基金

3+阅读 · 2013年12月31日

视觉原理指导下的动目标检测与跟踪新方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于深度信息面向主动视觉任务的视觉目标遮挡检测与规避方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

面向OTHR目标跟踪的多路径PHD滤波算法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于曲面柔韧度的三维形状局部特征描述符研究

国家自然科学基金

0+阅读 · 2012年12月31日

新型抗生素Bagremycins生物合成基因簇的鉴定与解析

国家自然科学基金

0+阅读 · 2012年12月31日

Fucí意义下的跨共振的Sturm-Liouville问题

国家自然科学基金

0+阅读 · 2012年12月31日

siRNA基因沉默与诱导双向基因治疗关节炎的软骨、滑膜生物学响应及ex vivo系统转基因在体示踪研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员