In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. Example scaling strategies may include increasing model width, depth, resolution, etc. While various scaling strategies exist, their tradeoffs are not fully understood. Existing analysis typically focuses on the interplay of accuracy and flops (floating point operations). Yet, as we demonstrate, various scaling strategies affect model parameters, activations, and consequently actual runtime quite differently. In our experiments we show the surprising result that numerous scaling strategies yield networks with similar accuracy but with widely varying properties. This leads us to propose a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent. Unlike currently popular scaling strategies, which result in about $O(s)$ increase in model activation w.r.t. scaling flops by a factor of $s$, the proposed fast compound scaling results in close to $O(\sqrt{s})$ increase in activations, while achieving excellent accuracy. This leads to comparable speedups on modern memory-limited hardware (e.g., GPU, TPU). More generally, we hope this work provides a framework for analyzing and selecting scaling strategies under various computational constraints.
翻译:在这项工作中,我们分析进化神经网络的缩放战略,即缩小基进化网络的进程,使其具有更大的计算复杂性,从而具有更大的代表性。示例缩放战略可能包括增加模型宽度、深度、分辨率等等。虽然存在各种缩放战略,但并不完全理解这些战略的权衡。现有的分析通常侧重于精确度和浮点操作的相互作用。然而,正如我们所显示的那样,各种缩放战略对模型参数、激活和随后的实际运行时间产生了不同的影响。在我们的实验中,我们显示出一个惊人的结果,即许多缩放战略产生类似精确度但性质差异很大的网络。这导致我们提出一个简单的快速复合缩放战略,主要鼓励缩放模型宽度,同时将深度和分辨率缩小到较小的程度。与目前流行的缩放战略不同,它导致模型激活 w.r.t.规模的汇率增加约美元。而拟议的快速复合缩放效果接近于$O(sqrt), 增产值,同时实现极精确性地精确度。这导致在现代记忆-平级(G)下选择一个普遍的硬化框架。