There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. In this paper we address this issue and significantly bridge the gap between these two types of models. Throughout our empirical investigation we do not aim to necessarily propose a new method, but strive to identify a robust and effective recipe for making state-of-the-art large scale models affordable in practice. We demonstrate that, when performed correctly, knowledge distillation can be a powerful tool for reducing the size of large models without compromising their performance. In particular, we uncover that there are certain implicit design choices, which may drastically affect the effectiveness of distillation. Our key contribution is the explicit identification of these design choices, which were not previously articulated in the literature. We back up our findings by a comprehensive empirical study, demonstrate compelling results on a wide range of vision datasets and, in particular, obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.
翻译:实现最先进业绩的大型模型和在实际应用中可负担得起的模型之间在计算机愿景上的差距越来越大。在本文中,我们讨论这一问题,并大大缩小这两种模型之间的差距。在我们的整个实证调查中,我们的目的不一定是提出新的方法,而是努力找出使最先进大规模模型在实践中可负担得起的可靠和有效的方法。我们证明,如果正确进行,知识蒸馏可以成为在不损害其业绩的情况下缩小大型模型规模的强大工具。特别是,我们发现存在某些隐含的设计选择,这可能极大地影响蒸馏的效果。我们的主要贡献是明确确定这些设计选择,这些选择以前在文献中没有阐述过。我们通过全面的经验研究来支持我们的结论,通过全面的经验研究,展示关于广泛视觉数据集的令人信服的结果,特别是获得最先进的图像网络ResNet-50模型,该模型达到82.8 % 最高至1的精确度。