The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini -- originally a monolingual English model -- can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.
翻译:当前,为低资源语言训练大语言模型(LLMs)所需的巨大计算成本阻碍了人工智能的民主化进程。本文提出了Persian-Phi,一个拥有38亿参数的模型,它挑战了“强大的多语言能力需要庞大模型规模或多语言基线”的假设。我们展示了如何通过一种新颖且资源高效的课程学习流程,将原本为单语(英语)模型的Microsoft Phi-3 Mini有效地适应波斯语。我们的方法采用了一个独特的“预热”阶段,使用双语叙事数据(Tiny Stories)在密集训练前对齐嵌入表示,随后通过参数高效微调(PEFT)进行持续预训练和指令调优。尽管模型尺寸紧凑,Persian-Phi在HuggingFace的Open Persian LLM Leaderboard上取得了具有竞争力的结果。我们的研究为将最先进的大语言模型以最低硬件资源扩展到代表性不足的语言,提供了一个经过验证且可扩展的框架。Persian-Phi模型已在https://huggingface.co/amirakhlaghiqqq/PersianPhi 公开提供。