Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors. Given their good performance, the extract-then-process pipeline significantly restricts the inference speed and therefore limits their real-world use cases. However, training vision language models from raw image pixels is difficult, as the raw image pixels give much less prior knowledge than region features. In this paper, we systematically study how to leverage auxiliary visual pretraining tasks to help training end-to-end vision language models. We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy. Compared with region feature models, our end-to-end models could achieve similar or better performance on downstream tasks and run more than 10 times faster during inference. Compared with other end-to-end models, our proposed method could achieve similar or better performance when pretrained for only 10% of the pretraining GPU hours.
翻译:当前的视觉语言预培训模式主要以使用从物体探测器中提取的区域视觉特征的方法为主。 由于其表现良好, 提取后处理管道会大大限制推断速度, 从而限制其实际使用案例。 但是, 原始图像像素的培训视觉语言模型很难, 因为原始图像像素比区域特征提供的知识要少得多。 在本文中, 我们系统地研究如何利用辅助视觉预培训任务来帮助培训端到端的视觉语言模型。 我们引入了三种视觉损失类型, 从而能够更快地趋同和更好地微调准确性。 与区域特征模型相比, 我们的端到端模型可以在下游任务上实现类似或更好的表现, 并在推断过程中运行速度要快10倍以上。 与其他端到端模型相比, 我们提出的方法在培训前的GPU小时只有10%时可以达到类似或更好的表现。