强化单一试剂强化学习的扩大法律 (Scaling laws for single-agent reinforcement learning)

Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.

翻译：最近的工作表明,在基因建模方面,跨热带损失随着模型规模和培训的计算,按照权力法和不断的缩放法,随着模型规模和培训的计算而平稳地改善。将这些结果扩展至强化学习的一个挑战是,兴趣的主要性能目标(平均回流)并不需要顺利地变化。为了克服这一点,我们引入了“自然性能* ”, 返回的单一性能被定义为在不同大小的模型大家庭中实现给定回报所需的最低计算值。我们发现,在一系列环境中,内在性能尺度在模型规模和环境相互作用中是权力法的。因此,在基因建模中,最佳的模型规模尺度作为培训计算预算中的权力法。此外,我们研究这种关系与环境以及培训设置的其他属性如何不同。特别是,我们使用一个以微小的MNIST为基础的环境,我们发现任务“高度长度”的变化主要是改变系数,而不是这种关系的延伸。