Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the \textit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. Even worse, the curse of capacity gap can not be lifted without additional compute, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is impossible to distill a large teacher to a good student without notably additional compute. However, the tale is not ever one-sided. It is always not late to acquire that using a large teacher is resource-demanding. Consequently, instead of sticking to lifting the curse, leaving the curse as is and using a small yet adequate teacher should be arguably fine. Even better, in this paper, we take the spirits of scaling law and reveal that the optimal teacher scale is almost consistently and linearly correlated to the student scale across different model architectures and data scales, fortunately turning the curse into a \textit{law} of capacity gap. The law later guides us to distil a 3B student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.
翻译:暂无翻译