The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that the pre-training benefits from gigantic model capacity. We are hereby curious and ask: after pre-training, does a pre-trained model indeed have to stay large for its universal downstream transferability? In this paper, we examine the supervised and self-supervised pre-trained models through the lens of lottery ticket hypothesis (LTH). LTH identifies highly sparse matching subnetworks that can be trained in isolation from (nearly) scratch, to reach the full models' performance. We extend the scope of LTH to questioning whether matching subnetworks still exist in the pre-training models, that enjoy the same downstream transfer performance. Our extensive experiments convey an overall positive message: from all pre-trained weights obtained by ImageNet classification, simCLR and MoCo, we are consistently able to locate such matching subnetworks at 59.04% to 96.48% sparsity that transfer universally to multiple downstream tasks, whose performance see no degradation compared to using full pre-trained weights. Further analyses reveal that subnetworks found from different pre-training tend to yield diverse mask structures and perturbation sensitivities. We conclude that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases. Codes and pre-trained models will be made available at: https://github.com/VITA-Group/CV_LTH_Pre-training.
翻译:计算机预视世界在各种预先培训的模型中,包括古典图像网监管的预培训前和最近出现的自我监督的预培训前(如SIMCLR)和MoCo。预培训权重通常会促进一系列广泛的下游任务,包括分类、检测和分化。最新研究表明,培训前培训从巨大的模型能力中获益。我们在此好奇并询问:在培训前,预培训模式确实必须保持其普遍的下游可转让性。在本文中,我们通过彩票假设(LTH)的镜头,来检查受监管和自我监督的预培训前模式。LTH发现高度稀少的匹配的子网络,可以在远离(近距离)抓抓、达到完整模型性能。我们扩大LTH的范围,以质疑在培训前模式中是否还存在匹配的子网络。我们的广泛实验传达了一个总体的正面信息:从图像网分类、SIMLR和MoCicial 获得的所有经培训前加权权重的模型,我们在96度假设(LHTH)中始终能够找到(一般情况下)预认证的下游数据序列分析。