Token Flow: 重新思考视野-语言检索中精细的跨模式协调 (TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval) - 专知论文

会员服务 ·

0

INFORMS · MoDELS · 相似度 · INTERACT · 特征向量 ·

2022 年 9 月 28 日

TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval

翻译：Token Flow: 重新思考视野-语言检索中精细的跨模式协调

Xiaohan Zou,Changqiao Wu,Lele Cheng,Zhongyuan Wang

Most existing methods in vision-language retrieval match two modalities by either comparing their global feature vectors which misses sufficient information and lacks interpretability, detecting objects in images or videos and aligning the text with fine-grained features which relies on complicated model designs, or modeling fine-grained interaction via cross-attention upon visual and textual tokens which suffers from inferior efficiency. To address these limitations, some recent works simply aggregate the token-wise similarities to achieve fine-grained alignment, but they lack intuitive explanations as well as neglect the relationships between token-level features and global representations with high-level semantics. In this work, we rethink fine-grained cross-modal alignment and devise a new model-agnostic formulation for it. We additionally demystify the recent popular works and subsume them into our scheme. Furthermore, inspired by optimal transport theory, we introduce \emph{TokenFlow}, an instantiation of the proposed scheme. By modifying only the similarity function, the performance of our method is comparable to the SoTA algorithms with heavy model designs on major video-text retrieval benchmarks. The visualization further indicates that \emph{TokenFlow} successfully leverages the fine-grained information and achieves better interpretability.

翻译：视觉-语言检索的大多数现有方法都与两种模式相匹配:要么比较缺乏足够信息且缺乏解释性的全球特征矢量,在图像或视频中探测对象,使文本与依赖复杂模型设计的精细刻度特征相匹配,要么通过对低效率的视觉和文字象征的交叉关注进行微细的模拟互动。为了解决这些局限性,一些最近的作品只是汇总了象征性的相似点,以达到细微的调整,但是它们缺乏直观的解释,并且忽视了象征性特征和与高层语义学的全球表现之间的关系。在这项工作中,我们重新思考了精细的跨模式调整,并为它设计了一种新的模型-认知式配方。我们进一步将最近流行的作品及其子嵌入我们的方案。此外,在最佳运输理论的启发下,我们引入了\emph{TokenFlow},这是拟议的方案的即时空化。通过只修改相似性功能,我们的方法的性能与Sota 模型设计与主要视频-文字检索基准的重重度模型相比。

0

相关内容

INFORMS

《计算机信息》杂志发表高质量的论文，扩大了运筹学和计算的范围，寻求有关理论、方法、实验、系统和应用方面的原创研究论文、新颖的调查和教程论文，以及描述新的和有用的软件工具的论文。官网链接：https://pubsonline.informs.org/journal/ijoc

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

10+阅读 · 2022年3月19日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

专知

13+阅读 · 2018年5月26日

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

专知

10+阅读 · 2018年3月2日

长江源区河流底质氮循环关键微生物解析及其对氮转化的作用机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

污染大气中有机成分对大气气溶胶表面非均相反应的影响及其作用机制

国家自然科学基金

0+阅读 · 2012年12月31日

以miRNA-768-3p为靶点增强鼻咽癌细胞对顺铂诱导凋亡的敏感性

国家自然科学基金

0+阅读 · 2012年12月31日

大气气溶胶及云量变化对森林生态系统净碳吸收的影响

国家自然科学基金

1+阅读 · 2012年12月31日

基于低毒性Mn: ZnS 量子点的活体肿瘤靶向荧光成像

国家自然科学基金

0+阅读 · 2012年12月31日

PARP-1/AIF信号通路在重离子诱导神经细胞凋亡中的调控作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

核苷酸切除修复通路基因tSNPs筛选及其与高发区食管癌易感性

国家自然科学基金

0+阅读 · 2010年12月31日

SENP1在癌基因诱导的细胞衰老中的作用与机制

国家自然科学基金

0+阅读 · 2009年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

联合188Re和肿瘤血管内皮特异性靶向蛋白GX/GEBP-TNF用于胃癌血管放射受体治疗

国家自然科学基金

0+阅读 · 2008年12月31日

Rethinking Hierarchicies in Pre-trained Plain Vision Transformer

Arxiv

0+阅读 · 2022年11月3日

Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives

Arxiv

0+阅读 · 2022年11月2日

Visual Attention Methods in Deep Learning: An In-Depth Survey

Arxiv

44+阅读 · 2022年4月16日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark

Arxiv

14+阅读 · 2021年11月11日

Cross-Modal Discrete Representation Learning

Arxiv

18+阅读 · 2021年6月10日

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

Arxiv

12+阅读 · 2020年8月11日

Embedding-based Retrieval in Facebook Search

Arxiv

12+阅读 · 2020年6月20日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Arxiv

17+阅读 · 2019年9月9日

VIP会员

文章信息

相关主题

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

10+阅读 · 2022年3月19日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

模型提取攻击与防御的系统综述：最新进展与展望

【博士论文】低维与高维空间中潜在表征的分析、建模与变换

【CMU博士论文】用于物理模拟的高效深度学习模型

大模型解决方案白皮书：社交陪伴场景全流程落地指南

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

专知

13+阅读 · 2018年5月26日

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

专知

10+阅读 · 2018年3月2日

相关论文

Rethinking Hierarchicies in Pre-trained Plain Vision Transformer

Arxiv

0+阅读 · 2022年11月3日

Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives

Arxiv

0+阅读 · 2022年11月2日

Visual Attention Methods in Deep Learning: An In-Depth Survey

Arxiv

44+阅读 · 2022年4月16日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark

Arxiv

14+阅读 · 2021年11月11日

Cross-Modal Discrete Representation Learning

Arxiv

18+阅读 · 2021年6月10日

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

Arxiv

12+阅读 · 2020年8月11日

Embedding-based Retrieval in Facebook Search

Arxiv

12+阅读 · 2020年6月20日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Arxiv

17+阅读 · 2019年9月9日

相关基金

长江源区河流底质氮循环关键微生物解析及其对氮转化的作用机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

污染大气中有机成分对大气气溶胶表面非均相反应的影响及其作用机制

国家自然科学基金

0+阅读 · 2012年12月31日

以miRNA-768-3p为靶点增强鼻咽癌细胞对顺铂诱导凋亡的敏感性

国家自然科学基金

0+阅读 · 2012年12月31日

大气气溶胶及云量变化对森林生态系统净碳吸收的影响

国家自然科学基金

1+阅读 · 2012年12月31日

基于低毒性Mn: ZnS 量子点的活体肿瘤靶向荧光成像

国家自然科学基金

0+阅读 · 2012年12月31日

PARP-1/AIF信号通路在重离子诱导神经细胞凋亡中的调控作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

核苷酸切除修复通路基因tSNPs筛选及其与高发区食管癌易感性

国家自然科学基金

0+阅读 · 2010年12月31日

SENP1在癌基因诱导的细胞衰老中的作用与机制

国家自然科学基金

0+阅读 · 2009年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

联合188Re和肿瘤血管内皮特异性靶向蛋白GX/GEBP-TNF用于胃癌血管放射受体治疗

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员