Training large neural networks is possible by training a smaller hypernetwork that predicts parameters for the large ones. A recently released Graph HyperNetwork (GHN) trained this way on one million smaller ImageNet architectures is able to predict parameters for large unseen networks such as ResNet-50. While networks with predicted parameters lose performance on the source task, the predicted parameters have been found useful for fine-tuning on other tasks. We study if fine-tuning based on the same GHN is still useful on novel strong architectures that were published after the GHN had been trained. We found that for recent architectures such as ConvNeXt, GHN initialization becomes less useful than for ResNet-50. One potential reason is the increased distribution shift of novel architectures from those used to train the GHN. We also found that the predicted parameters lack the diversity necessary to successfully fine-tune parameters with gradient descent. We alleviate this limitation by applying simple post-processing techniques to predicted parameters before fine-tuning them on a target task and improve fine-tuning of ResNet-50 and ConvNeXt.
翻译:培训大型神经网络可以通过培训一个小型超强网络来培训大型神经网络。最近推出的Great HyperNetwork(GHN)通过100万个较小的图像网络结构来进行这种培训,能够预测大型的不可见网络的参数,如ResNet-50。虽然预测参数网络在源任务上丧失了性能,但预测参数被认为对微调其他任务的有用。我们研究的是,根据同一GHN进行微调是否对GHN在GHN培训后出版的新颖的强力结构仍然有用。我们发现,对于ConvNeXt等最近的建筑来说,GHMN初始化比ResNet-50要少。一个潜在原因是,新结构的分布变化比用于培训GHN的系统更加扩大。我们还发现,预测参数缺乏必要的多样性,无法成功地微调具有梯度的参数。我们通过在对目标任务进行微调之前对预测参数应用简单的后处理技术来减轻这一限制。我们发现,在对ResNet-50和ConvenXt进行微调。