In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
翻译:在本文中,我们设计并培训了一个“产生图像到文字变换器”,即“GIT”,以统一图像/视频字幕和答题等视觉语言任务。虽然基因模型在培训前和微调之间提供了一个一致的网络结构,但现有工作通常包含复杂的结构(单式/多式编码器/代码器),并依赖物体探测器/标签器和光学字符识别等外部模块。在“GIT”中,我们将结构简化为一个图像编码器和文本解码器,在一个单一语言建模任务下。我们还扩大了培训前数据和模型尺寸,以提升模型性能。没有钟声和哨子,我们的“GIT”在12个挑战性基准上建立了新的艺术状态,有很大的利润。例如,我们的模型首次在TextCaps(138.2对CIDER的125.5)上超越了人类的性能。此外,我们提出了一个新一代图像分类和场景文本识别新计划,在标准基准上实现了体面的业绩。