The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.
翻译:语言模型的性能被证明在参数计数中被有效地模拟为权力法。 我们在这里研究运行网络的缩放行为: 在处理输入时有条件地只使用其参数子集的结构。 对于这些模型, 参数计数和计算要求形成两个独立的轴, 并随之提高性能。 在这项工作中, 我们推算并论证了对这两个变量界定的缩放法, 这些变量概括了标准语言模型已知的变量, 并描述了通过三种不同技术培训的多种路线结构的性能。 之后, 我们提供了两种法律的应用: 首先, 生成一个有效的参数计数, 所有模型都以相同速度标定, 然后使用缩放系数来对所考虑的三种路由技术进行定量比较。 我们的分析来自对运行网络五级大小的广泛评估, 包括有数百名专家的模型和数千亿个参数的模型。