Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
翻译:近期基础模型的进展突显了多阶段训练的重要优势,其中中期训练作为衔接预训练与后训练的关键阶段日益凸显。中期训练的特点在于利用中间数据与计算资源,系统性地增强特定能力(如数学、编程、推理与长上下文扩展),同时保持基础能力。本文对大语言模型(LLMs)的中期训练进行了正式定义,并探究了涵盖数据筛选、训练策略与模型架构优化的整体优化框架。我们结合目标导向的干预机制分析了主流模型实现,阐明中期训练如何成为LLM能力渐进式发展中一个独特且关键的阶段。通过厘清中期训练的独特贡献,本综述提供了系统的分类体系与可操作的见解,以支持未来LLM发展的研究与创新。