Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion, potentially varying from paper to paper and one model size to the next. The $\mu$-Parameterization ($\mu$P) offers a potential solution to these challenges, yielding scaling rules for model initialization and learning rates, and reportedly enabling zero-shot hyperparameter transfer from small to large models in a variety of cases. Despite the evident promise, the $\mu$P scaling rules are not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? From models with 2M to 10B parameters, we show that $\mu$-Transfer works as intended for the majority of important cases, but also identify some surprising cases where it may not.
翻译:暂无翻译