VOLO:视觉识别远景展望 (VOLO: Vision Outlooker for Visual Recognition) - 专知论文

会员服务 ·

0

Performer · Vision · MoDELS · AIM · 情景 ·

2021 年 6 月 24 日

VOLO: Vision Outlooker for Visual Recognition

翻译：VOLO:视觉识别远景展望

Li Yuan,Qibin Hou,Zihang Jiang,Jiashi Feng,Shuicheng Yan

from arxiv, code: https://github.com/sail-sg/volo

Visual recognition has been dominated by convolutionalneural networks (CNNs) for years. Though recently the pre-vailing vision transformers (ViTs) have shown great poten-tial of self-attention based models in ImageNet classifica-tion, their performance is still inferior to latest SOTA CNNsif no extra data are provided. In this work, we aim to closethe performance gap and demonstrate that attention-basedmodels are indeed able to outperform CNNs. We found thatthe main factor limiting the performance of ViTs for Ima-geNet classification is their low efficacy in encoding fine-level features into the token representations. To resolvethis, we introduce a noveloutlook attentionand present asimple and general architecture, termed Vision Outlooker(VOLO). Unlike self-attention that focuses on global depen-dency modeling at a coarse level, the outlook attention aimsto efficiently encode finer-level features and contexts intotokens, which are shown to be critical for recognition per-formance but largely ignored by the self-attention. Experi-ments show that our VOLO achieves 87.1% top-1 accuracyon ImageNet-1K classification, being the first model exceed-ing 87% accuracy on this competitive benchmark, withoutusing any extra training data. In addition, the pre-trainedVOLO transfers well to downstream tasks, such as seman-tic segmentation. We achieve 84.3% mIoU score on thecityscapes validation set and 54.3% on the ADE20K valida-tion set. Code is available at https://github.com/sail-sg/volo.

翻译：多年来,视觉认知一直以革命网络(CNNs)为主。尽管最近使用前的视觉变压器(ViTs)在图像网络类别中展示了基于自我关注的模型的极强陶器,但其性能仍然低于最新的SOTA CNNsif 没有提供额外数据。在这项工作中,我们的目标是缩小性能差距,并表明基于关注的模型确实能够超越CNN。我们发现,Ima-GeNet分类限制VITs业绩的主要因素是它们将微调功能编码为象征性表示式表达式的低效率。为了解决这个问题,我们引入了一种新颖的外观关注和呈现为简单和一般结构,称为Vision Outlooker(VOuter(VOLO) 。不同于侧重于全球依赖性建模前水平的自我关注, 展望关注的目标是高效率地将精细级的特征和背景编码化为20, 事实证明,这对于现有性能的识别能力十分关键,但基本上被自动忽略。 Experial-lamental com real laction on the VOLO-deal realalalalal laction adeal dal dalation salation laction.

1

相关内容

Performer

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

数字化健康白皮书，17页pdf

数字化健康白皮书，17页pdf

专知会员服务

109+阅读 · 2021年1月6日

【ICML2020-伯克利-马毅老师组】深度等距学习的视觉识别，Deep Isometric Learning for Visual Recognition

【ICML2020-伯克利-马毅老师组】深度等距学习的视觉识别，Deep Isometric Learning for Visual Recognition

专知会员服务

25+阅读 · 2020年7月1日

【知识迁移视觉识别综述论文】Knowledge Transfer in Vision Recognition: A Survey

【知识迁移视觉识别综述论文】Knowledge Transfer in Vision Recognition: A Survey

专知会员服务

30+阅读 · 2020年4月19日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日

【剑桥大学】神经机器翻译综述论文，Neural Machine Translation: A Review，附88页pdf

【剑桥大学】神经机器翻译综述论文，Neural Machine Translation: A Review，附88页pdf

专知会员服务

37+阅读 · 2019年12月4日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

论文小综 | Using External Knowledge on VQA

论文小综 | Using External Knowledge on VQA

开放知识图谱

10+阅读 · 2020年10月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

PaperWeekly

7+阅读 · 2019年5月5日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

(TensorFlow)实时语义分割比较研究

(TensorFlow)实时语义分割比较研究

机器学习研究会

9+阅读 · 2018年3月12日

语义分割+视频分割开源代码集合

语义分割+视频分割开源代码集合

极市平台

35+阅读 · 2018年3月5日

Python机器学习教程资料/代码

Python机器学习教程资料/代码

机器学习研究会

8+阅读 · 2018年2月22日

【推荐】全卷积语义分割综述

【推荐】全卷积语义分割综述

机器学习研究会

19+阅读 · 2017年8月31日

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Arxiv

0+阅读 · 2021年8月26日

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Arxiv

0+阅读 · 2021年8月24日

Support-Set Based Cross-Supervision for Video Grounding

Arxiv

0+阅读 · 2021年8月24日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Arxiv

18+阅读 · 2021年4月4日

Gated Channel Transformation for Visual Recognition

Arxiv

4+阅读 · 2020年3月27日

Compact Trilinear Interaction for Visual Question Answering

Arxiv

6+阅读 · 2019年9月26日

Local Relation Networks for Image Recognition

Local Relation Networks for Image Recognition

Arxiv

4+阅读 · 2019年4月25日

SlowFast Networks for Video Recognition

SlowFast Networks for Video Recognition

Arxiv

4+阅读 · 2019年4月18日

Global-and-local attention networks for visual recognition

Global-and-local attention networks for visual recognition

Arxiv

5+阅读 · 2018年9月6日

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

Arxiv

4+阅读 · 2017年11月27日

VIP会员

文章信息

相关主题

相关VIP内容

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

数字化健康白皮书，17页pdf

数字化健康白皮书，17页pdf

专知会员服务

109+阅读 · 2021年1月6日

【ICML2020-伯克利-马毅老师组】深度等距学习的视觉识别，Deep Isometric Learning for Visual Recognition

【ICML2020-伯克利-马毅老师组】深度等距学习的视觉识别，Deep Isometric Learning for Visual Recognition

专知会员服务

25+阅读 · 2020年7月1日

【知识迁移视觉识别综述论文】Knowledge Transfer in Vision Recognition: A Survey

【知识迁移视觉识别综述论文】Knowledge Transfer in Vision Recognition: A Survey

专知会员服务

30+阅读 · 2020年4月19日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日

【剑桥大学】神经机器翻译综述论文，Neural Machine Translation: A Review，附88页pdf

【剑桥大学】神经机器翻译综述论文，Neural Machine Translation: A Review，附88页pdf

专知会员服务

37+阅读 · 2019年12月4日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

《商用大语言模型的升级风险管理：国家安全运用》

自主人工智能：未来战争是否将是自主化的？

《从装备到文化：美陆军技术素养建设启示录》最新报告

相关资讯

论文小综 | Using External Knowledge on VQA

论文小综 | Using External Knowledge on VQA

开放知识图谱

10+阅读 · 2020年10月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

PaperWeekly

7+阅读 · 2019年5月5日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

(TensorFlow)实时语义分割比较研究

(TensorFlow)实时语义分割比较研究

机器学习研究会

9+阅读 · 2018年3月12日

语义分割+视频分割开源代码集合

语义分割+视频分割开源代码集合

极市平台

35+阅读 · 2018年3月5日

Python机器学习教程资料/代码

Python机器学习教程资料/代码

机器学习研究会

8+阅读 · 2018年2月22日

【推荐】全卷积语义分割综述

【推荐】全卷积语义分割综述

机器学习研究会

19+阅读 · 2017年8月31日

相关论文

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Arxiv

0+阅读 · 2021年8月26日

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Arxiv

0+阅读 · 2021年8月24日

Support-Set Based Cross-Supervision for Video Grounding

Arxiv

0+阅读 · 2021年8月24日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Arxiv

18+阅读 · 2021年4月4日

Gated Channel Transformation for Visual Recognition

Arxiv

4+阅读 · 2020年3月27日

Compact Trilinear Interaction for Visual Question Answering

Arxiv

6+阅读 · 2019年9月26日

Local Relation Networks for Image Recognition

Local Relation Networks for Image Recognition

Arxiv

4+阅读 · 2019年4月25日

SlowFast Networks for Video Recognition

SlowFast Networks for Video Recognition

Arxiv

4+阅读 · 2019年4月18日

Global-and-local attention networks for visual recognition

Global-and-local attention networks for visual recognition

Arxiv

5+阅读 · 2018年9月6日

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

Arxiv

4+阅读 · 2017年11月27日

微信扫码咨询专知VIP会员