This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense captioning. Code is available at https://github.com/JialianW/GRiT
翻译:本文展示了用于对象理解的生成 Region- Text 变压器, GRIT 。 GRIT 的精神是将对象理解设定为 < 区域, 文本 > 配对, 区域定位对象和文字描述对象。 例如, 物体探测中的文字表示类名, 而密集字幕中则指描述性句。 具体地说, GRIT 包含用于提取图像特征的视觉编码器、 定位对象的地表物体提取器、 生成开立对象描述的文本解码器。 在同一模型结构下, GRIT 不仅可以通过简单的名词来理解对象,还可以通过包括对象属性或动作在内的内容丰富的描述性句来理解对象。 实验性地说, 我们将GRIT 应用于对象探测和密集字幕任务。 GRIT 在 CO 2017 用于物体探测的测试- dev 上实现了60.4 AP, 在视觉基因组用于密集说明的15.5 mAP 。 代码见 https://github.com/JialianW/GRIT