Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.
翻译:将简单结构与大规模预培训相结合,使图像分类大有改进。对于物体检测、预培训和升级方法没有很好地确立,特别是在培训数据相对稀少的长尾和开放式词汇设置方面。在本文中,我们提出了将图像文本模型转换为开放词汇对象检测的强有力方案。我们使用标准愿景变换结构,只有最小的修改、对比图像文本预培训以及端到端检测微调。我们对这一设置的规模化特性的分析表明,不断提高的图像级预培训和模型规模可以持续改进下游检测任务。我们提供了必要的适应战略和规范化,以便在零光文本附加条件和一发图像附加条件的物体检测上取得非常强的性能。GitHub提供了代码和模型。