In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.
翻译:在本文中,我们设计并培训了一个“产生图像到文字变换器”,即GIT,以统一图像/视频字幕和回答问题等视觉语言任务。虽然基因模型在培训前和微调之间提供了一个一致的网络结构,但现有的工作通常包含复杂的结构(uni/multi-modal encoder/decoder),并依赖于物体探测器/标签和光学字符识别等外部模块。在GIT,我们将结构简化为一个图像编码器和文本解码器,在一个单一语言建模任务下进行。我们还扩大了培训前的数据和模型尺寸,以提升模型性能。在没有钟声和哨的情况下,我们的GIT在12个挑战性基准上建立了新的艺术状态,有一个很大的边际。例如,我们的模型首次在TextCaps(138.2对CIDER的125.5)上超越了人类的性能。此外,我们提出了一个新的基于新一代图像分类和场景文本识别方法,在标准基准上达到体面的性能。代码在urlas/grustryus@glibub.com/gromasomeGusty/somesomesomesyaltistriapryrealtime)发布。