The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.
翻译:对许多愿景任务采取“不法”的方法,是从预先培训的视觉表现开始,通常通过图像网络的监督下培训学习。最近的方法探索了未经监督的训练前,以推广到大量未贴标签的图像。相反,我们的目标是从较少的图像中学习高质量的视觉表现。为此,我们重新审视受监督的训练前,并寻找基于分类的训练前的数据效率替代方法。我们提议VirTex -- -- 一种培训前方法,使用语义密集的字幕学习视觉表现。我们从零开始对COCOC Caption进行连动网络培训,并将这些网络转移到下游识别任务,包括图像分类、对象探测和实例分割。关于所有任务,VirTex生成的特征与在图像网络上学习的特征相匹配或超过这些特征 -- -- 受监督或不受监督的特征 -- -- 尽管使用了多达10倍的图像。