联合视觉语义理由:用于文本识别的多级解码器 (Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition) - 专知论文

会员服务 ·

0

INFORMS · Performer · 解码 · 注意力机制 · SOTA ·

2021 年 7 月 27 日

Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

翻译：联合视觉语义理由:用于文本识别的多级解码器

Ayan Kumar Bhunia,Aneeshan Sain,Amandeep Kumar,Shuvozit Ghose,Pinaki Nath Chowdhury,Yi-Zhe Song

from arxiv, IEEE International Conference on Computer Vision (ICCV), 2021

Although text recognition has significantly evolved over the years, state-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complementary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, the prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin.

翻译：虽然多年来对文本的识别发生了显著变化,但是由于背景复杂、字体不同、无控制的光照、扭曲和其他手工艺品等原因,在野生情景中,最先进的(SOTA)模型仍在挣扎。这是因为这些模型完全依赖视觉信息来识别文本,因此缺乏语义推理能力。在本文中,我们争辩说语义信息除了仅具有视觉特征之外,还起到补充作用。更具体地说,我们进一步利用语义信息,提出多阶段多级多级分解器,进行视觉-语义联合推理。我们的新颖之处在于直觉,即文字识别、预测应当以阶段方式加以完善。因此,我们的关键贡献在于设计一个阶段性、不流动的注意力解译器,通过离析性预测性字符标签而引出非差异性,在端对端培训中可以绕过。我们的第一个阶段是使用视觉特征预测,然后用共同视觉-语义推理推理来完善其顶端。此外,我们引入了多级的2D关注点,同时进行文字识别,预测时应当以阶段的方式改进。因此,我们的关键贡献是设计一个阶段的分级的分级分级分级分级的分级,以不同的级,以不同的级,在不同的级中,以显示不同级的级,在级级级级的级级级模型中,以不同的级中,以不同级级级级的级,显示不同的级,显示不同的级,以级,以级,以级,以不同的级方法展示。

0

相关内容

INFORMS

《计算机信息》杂志发表高质量的论文，扩大了运筹学和计算的范围，寻求有关理论、方法、实验、系统和应用方面的原创研究论文、新颖的调查和教程论文，以及描述新的和有用的软件工具的论文。官网链接：https://pubsonline.informs.org/journal/ijoc

【文献综述】Text Detection and Recognition in the Wild: A Review 自然文本检测与识别

【文献综述】Text Detection and Recognition in the Wild: A Review 自然文本检测与识别

专知会员服务

44+阅读 · 2020年6月11日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

24+阅读 · 2020年5月22日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

51+阅读 · 2020年5月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

53+阅读 · 2019年10月17日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

91+阅读 · 2019年10月16日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

77+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【IJCAI 2019 | tutorial】文本生成中的艺术字 Creative and Artistic Writing via Text Generation，北京大学|严睿

【IJCAI 2019 | tutorial】文本生成中的艺术字 Creative and Artistic Writing via Text Generation，北京大学|严睿

专知会员服务

15+阅读 · 2019年8月12日

【CVPR 2019 | tutorial】视觉识别Visual Recognition and Beyond，Facebook|Ross Girshick，Justin Johnson（李飞飞高徒）

【CVPR 2019 | tutorial】视觉识别Visual Recognition and Beyond，Facebook|Ross Girshick，Justin Johnson（李飞飞高徒）

专知会员服务

28+阅读 · 2019年6月16日

ICCV 2019 行为识别/视频理解论文汇总

ICCV 2019 行为识别/视频理解论文汇总

极市平台

15+阅读 · 2019年9月26日

revelation of MONet

revelation of MONet

CreateAMind

5+阅读 · 2019年6月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

26+阅读 · 2019年5月18日

语义分割 | context relation

语义分割 | context relation

极市平台

8+阅读 · 2019年2月9日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

《pyramid Attention Network for Semantic Segmentation》

《pyramid Attention Network for Semantic Segmentation》

统计学习与视觉计算组

44+阅读 · 2018年8月30日

Reinforcement Learning: An Introduction 2018第二版 500页

Reinforcement Learning: An Introduction 2018第二版 500页

CreateAMind

11+阅读 · 2018年4月27日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

8+阅读 · 2017年11月25日

Local Memory Attention for Fast Video Semantic Segmentation

Arxiv

0+阅读 · 2021年9月26日

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

Arxiv

1+阅读 · 2021年9月23日

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Arxiv

13+阅读 · 2021年1月5日

Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition

Arxiv

7+阅读 · 2020年12月4日

Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition

Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition

Arxiv

7+阅读 · 2020年6月8日

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

Arxiv

4+阅读 · 2020年1月11日

Neural Module Networks for Reasoning over Text

Neural Module Networks for Reasoning over Text

Arxiv

9+阅读 · 2019年12月10日

Multimodal Semantic Attention Network for Video Captioning

Arxiv

4+阅读 · 2019年5月8日

Global-and-local attention networks for visual recognition

Global-and-local attention networks for visual recognition

Arxiv

5+阅读 · 2018年9月6日

Visual Question Reasoning on General Dependency Tree

Arxiv

6+阅读 · 2018年3月31日

VIP会员

文章信息

相关主题

注意力机制

相关VIP内容

【文献综述】Text Detection and Recognition in the Wild: A Review 自然文本检测与识别

【文献综述】Text Detection and Recognition in the Wild: A Review 自然文本检测与识别

专知会员服务

44+阅读 · 2020年6月11日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

24+阅读 · 2020年5月22日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

51+阅读 · 2020年5月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

53+阅读 · 2019年10月17日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

91+阅读 · 2019年10月16日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

77+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【IJCAI 2019 | tutorial】文本生成中的艺术字 Creative and Artistic Writing via Text Generation，北京大学|严睿

【IJCAI 2019 | tutorial】文本生成中的艺术字 Creative and Artistic Writing via Text Generation，北京大学|严睿

专知会员服务

15+阅读 · 2019年8月12日

【CVPR 2019 | tutorial】视觉识别Visual Recognition and Beyond，Facebook|Ross Girshick，Justin Johnson（李飞飞高徒）

【CVPR 2019 | tutorial】视觉识别Visual Recognition and Beyond，Facebook|Ross Girshick，Justin Johnson（李飞飞高徒）

专知会员服务

28+阅读 · 2019年6月16日

热门VIP内容

相关资讯

ICCV 2019 行为识别/视频理解论文汇总

ICCV 2019 行为识别/视频理解论文汇总

极市平台

15+阅读 · 2019年9月26日

revelation of MONet

revelation of MONet

CreateAMind

5+阅读 · 2019年6月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

26+阅读 · 2019年5月18日

语义分割 | context relation

语义分割 | context relation

极市平台

8+阅读 · 2019年2月9日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

《pyramid Attention Network for Semantic Segmentation》

《pyramid Attention Network for Semantic Segmentation》

统计学习与视觉计算组

44+阅读 · 2018年8月30日

Reinforcement Learning: An Introduction 2018第二版 500页

Reinforcement Learning: An Introduction 2018第二版 500页

CreateAMind

11+阅读 · 2018年4月27日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

8+阅读 · 2017年11月25日

相关论文

Local Memory Attention for Fast Video Semantic Segmentation

Arxiv

0+阅读 · 2021年9月26日

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

Arxiv

1+阅读 · 2021年9月23日

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Arxiv

13+阅读 · 2021年1月5日

Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition

Arxiv

7+阅读 · 2020年12月4日

Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition

Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition

Arxiv

7+阅读 · 2020年6月8日

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

Arxiv

4+阅读 · 2020年1月11日

Neural Module Networks for Reasoning over Text

Neural Module Networks for Reasoning over Text

Arxiv

9+阅读 · 2019年12月10日

Multimodal Semantic Attention Network for Video Captioning

Arxiv

4+阅读 · 2019年5月8日

Global-and-local attention networks for visual recognition

Global-and-local attention networks for visual recognition

Arxiv

5+阅读 · 2018年9月6日

Visual Question Reasoning on General Dependency Tree

Arxiv

6+阅读 · 2018年3月31日

微信扫码咨询专知VIP会员