Text-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes and generating inaccurate captions. One possible reason is that current studies mainly focus on constructing the plane-level geometric relationship of scene text without depth information. This leads to insufficient scene text relational reasoning so that models may describe scene text inaccurately. The other possible reason is that existing methods fail to generate fine-grained descriptions of some visual objects. In addition, they may ignore essential visual objects, leading to the scene text belonging to these ignored objects not being utilized. To address the above issues, we propose a DEpth and VIsual ConcEpts Aware Transformer (DEVICE) for TextCaps. Concretely, to construct three-dimensional geometric relations, we introduce depth information and propose a depth-enhanced feature updating module to ameliorate OCR token features. To generate more precise and comprehensive captions, we introduce semantic features of detected visual object concepts as auxiliary information. Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities. Sufficient experiments demonstrate the effectiveness of our proposed DEVICE, which outperforms state-of-the-art models on the TextCaps test set. Our code will be publicly available.
翻译:以文字为基础的图像字幕是一项重要但探索不足的任务,目的是生成包含视觉对象和场景文字的描述。最近的研究取得了令人鼓舞的进展,但最近的研究仍因对场景缺乏全面了解和产生不准确的字幕而受到影响。一个可能的原因是,目前的研究主要侧重于在没有深度信息的情况下构建场景文本平面几何关系。这导致场景文本关系推理不充分,模型可以不准确地描述场景文本。另一个可能的原因是,现有方法未能生成一些视觉对象的精细描述。此外,它们可能忽略基本视觉对象,导致这些被忽视的未使用对象的现场文字。为了解决上述问题,我们建议为文本部分建立一个Depth & Visual Conpects Conpects Conpects Introductioner (DEVICE) (DEVICE) (DEVICE) (DVICE), 具体地说,为了构建三维度的文本关系,我们引入深度的功能更新模块。为了产生更精确和全面的说明,我们所展示的视觉对象概念的精确性实验将展示我们的视觉模型。