Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted by an object detector such as Faster R-CNN. However, they have several issues, such as lack of contextual information, the risk of inaccurate detection, and the high computational cost. The first two could be resolved by additionally using grid-based features. However, how to extract and fuse these two types of features is uncharted. This paper proposes a Transformer-only neural architecture, dubbed GRIT (Grid- and Region-based Image captioning Transformer), that effectively utilizes the two visual features to generate better captions. GRIT replaces the CNN-based detector employed in previous methods with a DETR-based one, making it computationally faster. Moreover, its monolithic design consisting only of Transformers enables end-to-end training of the model. This innovative design and the integration of the dual visual features bring about significant performance improvement. The experimental results on several image captioning benchmarks show that GRIT outperforms previous methods in inference accuracy and speed.
翻译:目前最先进的图像字幕描述方法采用基于区域的特征,因为它们提供了对描述图像内容至关重要的物体级信息;通常是由快速R-CNN等物体探测器提取的。 但是,它们有几个问题,例如缺乏背景信息、检测不准确的风险和高计算成本。前两个问题可以通过额外使用基于网格的特征加以解决。然而,如何提取和结合这两种类型的特征是没有图解的。本文建议采用一个仅供变异器使用的神经结构,称为GRIIT(基于Grid和基于区域图像的图像描述变换器),有效地利用两种视觉特征制作更好的说明。GRIT用基于DETR的先前方法取代CNN的探测器,使其计算速度更快。此外,其单项设计仅由变异器组成,能够对模型进行端对端培训。这种创新设计和将双重视觉特征整合,可以带来显著的性能改进。若干图像描述基准的实验结果显示,GRIT在以往方法中比速度。