The width of a neural network matters since increasing the width will necessarily increase the model capacity. However, the performance of a network does not improve linearly with the width and soon gets saturated. To tackle this problem, we propose to increase the number of networks rather than purely scaling up the width. To prove it, one large network is divided into several small ones, and each of these small networks has a fraction of the original one's parameters. We then train these small networks together and make them see various views of the same data to learn different and complementary knowledge. During this co-training process, networks can also learn from each other. As a result, small networks can achieve better ensemble performance than the large one with few or no extra parameters or FLOPs. \emph{This reveals that the number of networks is a new dimension of effective model scaling, besides depth/width/resolution}. Small networks can also achieve faster inference speed than the large one by concurrent running on different devices. We validate the idea -- increasing the number of networks is a new dimension of effective model scaling -- with different network architectures on common benchmarks through extensive experiments. The code is available at \url{https://github.com/mzhaoshuai/SplitNet-Divide-and-Co-training}.
翻译:随着宽度的提高,神经网络的宽度将必然增加模型能力。 但是, 网络的性能不会随着宽度而线性地改善, 并且很快会饱和。 为了解决这个问题, 我们建议增加网络数量, 而不是仅仅扩大宽度。 为了证明这一点, 一个大网络被分成几个小网络, 而每个小网络都有部分原来的参数。 然后我们把这些小网络一起训练, 让他们看到不同数据的不同观点来学习不同和互补的知识。 在这个共同培训过程中, 网络的性能也可以相互学习。 因此, 小网络可以实现更好的连结性功能, 而不是没有或没有额外参数或FLOPs的大型网络。\emph{ 这说明, 网络的数量是有效模型缩放的新层面, 除了深度/ width/ 分辨率} 。 小网络也可以通过同时运行不同设备, 实现比大网络更快的推断速度。 我们验证这个想法 -- 增加网络的数量是有效模型缩放的新层面 -- 并且通过广泛的实验, 不同的网络架构和共同基准。 在可使用的代码中, saqual- a- com- adtrainstrual- train.