高精细、具有超光远光谱统一视野 -- -- 语言培训前 (Fine-Grained Semantically Aligned Vision-Language Pre-Training) - 专知论文

会员服务 ·

0

Learning · Performer · INTERACT · INFORMS · Attention ·

2022 年 9 月 19 日

Fine-Grained Semantically Aligned Vision-Language Pre-Training

翻译：高精细、具有超光远光谱统一视野 -- -- 语言培训前

Juncheng Li,Xin He,Longhui Wei,Long Qian,Linchao Zhu,Lingxi Xie,Yueting Zhuang,Qi Tian,Siliang Tang

from arxiv, Accepted by NeurIPS 2022

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks. Furthermore, without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from large-scale raw image-text pairs. The repository of this work is at https://github.com/YYJMJC/LOUPE.

翻译：大规模视觉语言培训前培训在一系列广泛的下游任务中取得了令人印象深刻的进展。现有方法主要以图像和文字全球表述的相似性或图像和文字特征的高级超时关注来模拟跨模式调整;然而,它们未能明确学习视觉区域和文字短语之间的细微语义调整,因为只有全球图像-文字校正信息才能得到。在本文中,我们引入了LOUPE,这是一个精美的语义比对On-langUage PrE培训框架,从游戏-理论互动的新视角中学习精细的语义调整。为高效地理解游戏-理论互动,我们进一步提议了一个具有不确定性的神经特征互动学习模块。实验显示,LOUPE在各种视觉语言任务中达到了最先进的表现。此外,没有目标级人类描述和微调,LOUPE在对象探测和视觉定位互动互动的新动作上取得了具有竞争力的成绩。更重要的是,LOUPUPE在大型空间/视觉地面图像中开启了一个新的方向。

0

相关内容

Learning

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

10+阅读 · 2022年3月19日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

专知

30+阅读 · 2018年3月22日

贵金属团簇/CeO2的界面结构和化学活性对CeO2晶面的依赖性的理论研究

国家自然科学基金

0+阅读 · 2015年12月31日

皮肤光老化和细胞光损伤中的胆绿素代谢变化和外源性胆绿素基于Nrf2/ARE信号通路的保护作用机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

IIB族元素同核二聚物的电子基态和低激发态势能曲线的高精度计算研究

国家自然科学基金

0+阅读 · 2015年12月31日

Septin7活化Ca2+/CaN/NFAT2信号途径在糖尿病肾病足细胞损伤中的作用及机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

AMPK介导自噬在急性放射性皮炎防治中的作用与机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

Vlasov-Poisson-Boltzmann方程研究

国家自然科学基金

0+阅读 · 2013年12月31日

AngⅡ通过Bcl-2/Beclin1自噬途径调控血管内皮细胞衰老的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

二维失配层状Ca3Co4O9-δ阴极及其纳米复相阴极的电化学性能和氧催化作用机理

国家自然科学基金

0+阅读 · 2012年12月31日

DNA损伤诱导的p53非依赖性细胞凋亡途径- - -Bim途径

国家自然科学基金

0+阅读 · 2009年12月31日

对比度在视皮层神经元朝向选择性中的作用机制

国家自然科学基金

1+阅读 · 2008年12月31日

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

Arxiv

0+阅读 · 2022年10月27日

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Arxiv

0+阅读 · 2022年10月27日

Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

Arxiv

0+阅读 · 2022年10月27日

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning

Arxiv

0+阅读 · 2022年10月26日

Automatic universal taxonomies for multi-domain semantic segmentation

Arxiv

0+阅读 · 2022年10月26日

Fine-grained Entity Typing via Label Reasoning

Arxiv

12+阅读 · 2021年9月13日

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Arxiv

13+阅读 · 2021年4月7日

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Arxiv

10+阅读 · 2021年2月11日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

Deep Semantic Role Labeling with Self-Attention

Arxiv

13+阅读 · 2017年12月5日

VIP会员

文章信息

相关主题

相关VIP内容

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

10+阅读 · 2022年3月19日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

专知

30+阅读 · 2018年3月22日

相关论文

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

Arxiv

0+阅读 · 2022年10月27日

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Arxiv

0+阅读 · 2022年10月27日

Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

Arxiv

0+阅读 · 2022年10月27日

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning

Arxiv

0+阅读 · 2022年10月26日

Automatic universal taxonomies for multi-domain semantic segmentation

Arxiv

0+阅读 · 2022年10月26日

Fine-grained Entity Typing via Label Reasoning

Arxiv

12+阅读 · 2021年9月13日

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Arxiv

13+阅读 · 2021年4月7日

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Arxiv

10+阅读 · 2021年2月11日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

Deep Semantic Role Labeling with Self-Attention

Arxiv

13+阅读 · 2017年12月5日

相关基金

贵金属团簇/CeO2的界面结构和化学活性对CeO2晶面的依赖性的理论研究

国家自然科学基金

0+阅读 · 2015年12月31日

皮肤光老化和细胞光损伤中的胆绿素代谢变化和外源性胆绿素基于Nrf2/ARE信号通路的保护作用机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

IIB族元素同核二聚物的电子基态和低激发态势能曲线的高精度计算研究

国家自然科学基金

0+阅读 · 2015年12月31日

Septin7活化Ca2+/CaN/NFAT2信号途径在糖尿病肾病足细胞损伤中的作用及机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

AMPK介导自噬在急性放射性皮炎防治中的作用与机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

Vlasov-Poisson-Boltzmann方程研究

国家自然科学基金

0+阅读 · 2013年12月31日

AngⅡ通过Bcl-2/Beclin1自噬途径调控血管内皮细胞衰老的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

二维失配层状Ca3Co4O9-δ阴极及其纳米复相阴极的电化学性能和氧催化作用机理

国家自然科学基金

0+阅读 · 2012年12月31日

DNA损伤诱导的p53非依赖性细胞凋亡途径- - -Bim途径

国家自然科学基金

0+阅读 · 2009年12月31日

对比度在视皮层神经元朝向选择性中的作用机制

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员