When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, \emph{i.e.} image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.
翻译:当描述图像时, 在视觉场景中读取文本对于理解关键信息至关重要 。 最近的工作探索了文本 Caps 任务, \ emph{ i. e. } 图像, 以读取光字符识别符( OCR) 符号进行字幕说明, 需要模型阅读文本并将其覆盖在生成的字幕中。 现有方法无法生成准确描述, 原因是其 (1) 读能力差; (2) 无法在所有提取的OCR 符号中选择关键词; (3) 预言标题中重复单词 。 为此, 我们建议使用一个具有信心的非重复性多式变换器( CNMT) 来应对上述挑战 。 我们的CNMTM 包含一个阅读、 推理和 一代模块, 读取模块使用更好的 OCR 系统来增强读能力, 并嵌入最值得注意的符号 。 为了解决字幕中出现词冗余的问题, 我们的模块包含一个重复的遮罩, 以避免在标题中预测重复的单词。 我们的模型优于 。 在 CIDER 数据设置上, 从 81.0 到 93. 源代码 。