Large-scale pre-training has been proven to be crucial for various computer vision tasks. However, with the increase of pre-training data amount, model architecture amount, and the private/inaccessible data, it is not very efficient or possible to pre-train all the model architectures on large-scale datasets. In this work, we investigate an alternative strategy for pre-training, namely Knowledge Distillation as Efficient Pre-training (KDEP), aiming to efficiently transfer the learned feature representation from existing pre-trained models to new student models for future downstream tasks. We observe that existing Knowledge Distillation (KD) methods are unsuitable towards pre-training since they normally distill the logits that are going to be discarded when transferred to downstream tasks. To resolve this problem, we propose a feature-based KD method with non-parametric feature dimension aligning. Notably, our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time. Code is available at https://github.com/CVMI-Lab/KDEP.
翻译:然而,随着培训前数据数量、模型结构数量和私人/可获取数据的增加,在大规模数据集方面对所有模型结构进行预先培训并不十分有效,也不可能。在这项工作中,我们调查了培训前的替代战略,即知识蒸馏作为高效的预培训(KDEP),目的是有效地将学到的特色代表从现有的预先培训模式转移到新的学生模式,用于今后的下游任务。我们发现,现有的知识蒸馏(KD)方法不适合培训前培训,因为通常它们会蒸馏在转移到下游任务时将要丢弃的逻辑。为了解决这一问题,我们提出了一种基于地貌的KD方法,其非参数层面将加以调整。值得注意的是,我们的方法与3个下游任务和9个下游任务的培训前对应人员比较,需要10x的数据和5x培训前时间。代码见https://github.com/CVMI-Lab/KEP。