VIP内容

为了解决通用视觉问答(VQA)方法无法处理图像中文字信息的缺陷,文本视觉问答(TextVQA)任务被提出。TextVQA为了回答与图像中文字相关的问题,需要同时考虑视觉场景和文字等多个模态的信息及其关系,具有很大挑战。目前主流的方法通过引入一个外部的光学字符识别(OCR)模块作为前处理,再将其与VQA框架结合预测答案,这会使得TextVQA性能很大程度上受到OCR精度的影响,具体表现为以下两种误差累积传播现象:1)OCR错误使得对文字的直接语义编码错误,导致多模态信息的交互推理过程出现偏差,从而无法定位出准确的答案。2)即使是在推理和定位答案正确的情况下,OCR错误仍然会导致最终从OCR结果中“复制”的答案错误。另外,视觉物体模态与图像文字、问题模态交互时存在语义间隔,使得多模态信息无法有效融合。

本文简要介绍来自中国传媒大学和中国科学院信息工程研究所合作的ACM MM 2021的论文“Beyond OCR + VQA: Involving OCR into the Flow for Robust and Accurate TextVQA”。文章提出了一个对文字识别结果鲁棒的文本视觉问答方法BOV:通过将光学字符识别(OCR)融入文本视觉问答(TextVQA)的前向处理流程,即借助来自文字检测和文字识别两个阶段的多模态线索,实现在没有准确识别文字的情况下也能获取对文字的合理的语义表示,并利用TextVQA任务丰富的上下文信息对解码的答案进行自适应修正。

成为VIP会员查看完整内容
0
7

最新论文

This work aims to reproduce results from the CVPR 2020 paper by Gidaris et al. Self-supervised learning (SSL) is used to learn feature representations of an image using an unlabeled dataset. This work proposes to use bag-of-words (BoW) deep feature descriptors as a self-supervised learning target to learn robust, deep representations. BowNet is trained to reconstruct the histogram of visual words (ie. the deep BoW descriptor) of a reference image when presented a perturbed version of the image as input. Thus, this method aims to learn perturbation-invariant and context-aware image features that can be useful for few-shot tasks or supervised downstream tasks. In the paper, the author describes BowNet as a network consisting of a convolutional feature extractor $\Phi(\cdot)$ and a Dense-softmax layer $\Omega(\cdot)$ trained to predict BoW features from images. After BoW training, the features of $\Phi$ are used in downstream tasks. For this challenge we were trying to build and train a network that could reproduce the CIFAR-100 accuracy improvements reported in the original paper. However, we were unsuccessful in reproducing an accuracy improvement comparable to what the authors mentioned.

0
0
下载
预览
Top
微信扫码咨询专知VIP会员