Lottery tickets (LTs) is able to discover accurate and sparse subnetworks that could be trained in isolation to match the performance of dense networks. Ensemble, in parallel, is one of the oldest time-proven tricks in machine learning to improve performance by combining the output of multiple independent models. However, the benefits of ensemble in the context of LTs will be diluted since ensemble does not directly lead to stronger sparse subnetworks, but leverages their predictions for a better decision. In this work, we first observe that directly averaging the weights of the adjacent learned subnetworks significantly boosts the performance of LTs. Encouraged by this observation, we further propose an alternative way to perform an 'ensemble' over the subnetworks identified by iterative magnitude pruning via a simple interpolating strategy. We call our method Lottery Pools. In contrast to the naive ensemble which brings no performance gains to each single subnetwork, Lottery Pools yields much stronger sparse subnetworks than the original LTs without requiring any extra training or inference cost. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that our method achieves significant performance gains in both, in-distribution and out-of-distribution scenarios. Impressively, evaluated with VGG-16 and ResNet-18, the produced sparse subnetworks outperform the original LTs by up to 1.88% on CIFAR-100 and 2.36% on CIFAR-100-C; the resulting dense network surpasses the pre-trained dense-model up to 2.22% on CIFAR-100 and 2.38% on CIFAR-100-C.
翻译:彩票策略(LTs)可以发现准确而稀疏的子网络,这些子网络可以被单独训练以匹配密集网络的性能。并行集成是机器学习中最古老的时经过验证的技巧之一,通过组合多个独立模型的输出来提高性能。然而,在LTs的背景下,集成的好处将被稀释,因为集成并不直接产生更强的稀疏子网络,而是利用它们的预测进行更好的决策。在这项工作中,我们首先观察到直接平均相邻的学习子网络的权重显着提高了LTs的性能。受到这一观察的鼓舞,我们进一步提出了一种通过简单的插值策略在迭代幅度剪枝识别的子网络上执行“集成”的替代方法。我们将这种方法称为抽奖池。与毫无性能增益的朴素集成相比,抽奖池不需要任何额外的训练或推理成本就可以产生比原始LTs更强大的稀疏子网络。在CIFAR-10/100和ImageNet上的各种现代架构中,我们展示了我们的方法在内部分布和外部分布情景下都取得了显着的性能增益。令人印象深刻的是,在评估VGG-16和ResNet-18时,产生的稀疏子网络在CIFAR-100和CIFAR-100-C上的性能优于原始的LTs,分别高达1.88%和2.36%;由此产生的密集网络在CIFAR-100和CIFAR-100-C上的性能超过预先训练的密集模型,分别高达2.22%和2.38%。