We present Mify-Coder, a 2.5B-parameter code model trained on 4.2T tokens using a compute-optimal strategy built on the Mify-2.5B foundation model. Mify-Coder achieves comparable accuracy and safety while significantly outperforming much larger baseline models on standard coding and function-calling benchmarks, demonstrating that compact models can match frontier-grade models in code generation and agent-driven workflows. Our training pipeline combines high-quality curated sources with synthetic data generated through agentically designed prompts, refined iteratively using enterprise-grade evaluation datasets. LLM-based quality filtering further enhances data density, enabling frugal yet effective training. Through disciplined exploration of CPT-SFT objectives, data mixtures, and sampling dynamics, we deliver frontier-grade code intelligence within a single continuous training trajectory. Empirical evidence shows that principled data and compute discipline allow smaller models to achieve competitive accuracy, efficiency, and safety compliance. Quantized variants of Mify-Coder enable deployment on standard desktop environments without requiring specialized hardware.
翻译:本文提出Mify-Coder——一个基于Mify-2.5B基础模型、采用计算最优策略训练、拥有25亿参数并基于4.2万亿标记训练的代码生成模型。在标准代码生成与函数调用基准测试中,Mify-Coder在保持相当准确性与安全性的同时,显著超越规模更大的基线模型,证明了紧凑模型在代码生成与智能体驱动工作流中能够达到前沿模型的性能水平。我们的训练流程将高质量精选数据源与通过智能体设计提示生成的合成数据相结合,并利用企业级评估数据集进行迭代优化。基于大语言模型的质量过滤机制进一步提升了数据密度,实现了高效节能的训练效果。通过对条件预训练-监督微调目标、数据混合策略与采样动态的系统性探索,我们在单一连续训练轨迹内实现了前沿水平的代码智能。实证研究表明,规范化的数据与计算管理机制可使较小模型在准确性、效率与安全合规性方面达到竞争优势。Mify-Coder的量化变体可在标准桌面环境中部署,无需专用硬件支持。