The recent observation of neural power-law scaling relations has made a significant impact in the field of deep learning. A substantial amount of attention has been dedicated as a consequence to the description of scaling laws, although mostly for supervised learning and only to a reduced extent for reinforcement learning frameworks. In this paper we present an extensive study of performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the basis of a relationship between Elo rating, playing strength and power-law scaling, we train AlphaZero agents on the games Connect Four and Pentago and analyze their performance. We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute, and as a power of compute when training optimally sized agents. We observe nearly identical scaling exponents for both games. Combining the two observed scaling laws we obtain a power law relating optimal size to compute similar to the ones observed for language models. We find that the predicted scaling of optimal neural network size fits our data for both games. This scaling law implies that previously published state-of-the-art game-playing models are significantly smaller than their optimal size, given the respective compute budgets. We also show that large AlphaZero models are more sample efficient, performing better than smaller models with the same amount of training data.
翻译:最近观察到的神经动力法升级关系在深层学习领域产生了重大影响。 大量注意力已经用于描述比例法, 尽管主要用于监督学习, 并且只是用于强化学习框架。 在本文中, 我们展示了对基础强化学习算法( AlphaZero)绩效缩放的广泛研究。 根据Elo 评级、 游戏实力和权力法缩放之间的关系, 我们在连接四和彭塔哥的游戏中培训了阿尔法泽罗代理商, 并分析了他们的业绩。 我们发现, 玩家实力表是神经网络参数计数中的一种权力法则, 不因可用计算而受瓶颈, 而在培训最佳规模的代理商时, 也是一种计算能力。 我们观察了两个游戏几乎完全相同的缩放率。 合并了我们所观察到的两个比例法, 与语言模型所观察到的缩放量相似。 我们发现, 最佳神经网络规模的预计缩放比例适合我们两个游戏的数据。 这一缩放法意味着, 先前公布的神经网络参数计数的比现有的计算模型要小得多, 我们所显示的是, 以不同的模型的缩放率也小得多。