Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are order(s) of magnitude smaller than Imagenet. Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings.We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains. On COCO, when pre-training solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting.
翻译:大型数据集的培训前模型,如图像网络,是计算机愿景的标准做法。对于小型培训组的任务来说,这种模式特别有效,因为高容量模型往往会过度适应。在这项工作中,我们考虑的是自我监督的培训前假设情景,它只能利用目标任务数据。我们考虑的是像斯坦福汽车、Strach或COCO这样的数据集,它们比图像网规模小得多。我们的研究显示,取消自动编码器,例如BeiT或本文中介绍的变体,对于培训前数据的类型和规模来说,比通过比较图像嵌入所培训的流行的自我监督方法更为有力。我们取得了与图像网络关于不同领域各种分类数据集的预先培训相比的竞争性业绩。关于COCOCO,当培训前仅使用COCO图像,检测和实例分解性业绩超过在可比环境中监督的图像网络前培训。