This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.
翻译:本研究首次大规模探索了为非英语语言(特别是韩语)构建完全开放的双语大语言模型(LLM),该模型主要基于合成数据训练。我们提出了KORMo-10B,这是一个拥有108亿参数、从零开始在韩英双语语料上训练的模型,其中韩语部分的68.74%为合成数据。通过系统性实验,我们证明当合成数据经过精心策划,具备平衡的语言覆盖和多样化的指令风格时,在大规模预训练过程中不会导致模型不稳定或性能下降。此外,该模型在广泛的推理、知识和指令遵循基准测试中,达到了与当前开放权重的多语言基线模型相当的性能。我们的实验揭示了两个关键发现:(1)合成数据能够可靠地支持长周期预训练而不会引发模型崩溃;(2)双语指令微调能使模型在韩语中实现接近母语水平的推理和话语连贯性。通过完整发布包括数据、代码、训练方案和日志在内的所有组件,本研究为在低资源环境下开发合成数据驱动的完全开放模型(FOM)建立了一个透明框架,并为未来的多语言LLM研究树立了可复现的先例。