Image captioning is currently a challenging task that requires the ability to both understand visual information and use human language to describe this visual information in the image. In this paper, we propose an efficient way to improve the image understanding ability of transformer-based method by extending Object Relation Transformer architecture with Attention on Attention mechanism. Experiments on the VieCap4H dataset show that our proposed method significantly outperforms its original structure on both the public test and private test of the Image Captioning shared task held by VLSP.
翻译:摘要: 图像字幕生成是一项目前具有挑战性的任务,它要求同时具备理解视觉信息和使用人类语言来描述图像中的信息的能力。在本文中,我们提出了一种高效的方法,通过在对象关系Transformer结构中引入Attention on Attention机制来扩展其图像理解能力。在VieCap4H数据集上的实验表明,我们提出的方法在VLSP举办的Image Captioning共享任务的公共测试和私人测试上显著优于其原始结构。