Ensemble of machine learning models yields improved performance as well as robustness. However, their memory requirements and inference costs can be prohibitively high. Knowledge distillation is an approach that allows a single model to efficiently capture the approximate performance of an ensemble while showing poor scalability as demand for re-training when introducing new teacher models. In this paper, we study if we can utilize the meta-learning strategy to directly predict the parameters of a single model with comparable performance of an ensemble. Hereto, we introduce WeightFormer, a Transformer-based model that can predict student network weights layer by layer in a forward pass, according to the teacher model parameters. The proprieties of WeightFormer are investigated on the CIFAR-10, CIFAR-100, and ImageNet datasets for model structures of VGGNet-11, ResNet-50, and ViT-B/32, where it demonstrates that our method can achieve approximate classification performance of an ensemble and outperforms both the single network and standard knowledge distillation. More encouragingly, we show that WeightFormer results can further exceeds average ensemble with minor fine-tuning. Importantly, our task along with the model and results can potentially lead to a new, more efficient, and scalable paradigm of ensemble networks parameter learning.
翻译:机器学习模型的组合可以提高学习性能和稳健性。 但是,它们的记忆要求和推论成本可能高得令人望而却步。 知识蒸馏是一种方法,使单一模型能够有效捕捉组合的大致性能,同时在引入新的教师模型时显示作为再培训需求而进行微弱的缩放能力,作为再培训的需求。 在本文中,我们研究我们是否可以利用元学习战略直接预测单一模型的参数,并具有一个组合的类似性能。 在这里,我们引入了WeightFormer, 一种基于变异器的模型,能够根据教师模型参数,在前传分层预测学生网络的重量层。 在CIFAR-10、CIFAR-100和VGNet-11、ResNet-50和VIT-B/32模型结构的图像网络数据集中,对WeightFormer的特性进行了调查,从而显示我们的方法可以达到一个组合和外形的大致分类性能和标准知识蒸馏。 更令人欣慰的是,我们展示了WeightFormer Form 的结果可以进一步超越我们普通的模型, 和方向的模型, 和方向的模型的模型的模型可以进一步调整。