Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.
翻译:近年来,大语言模型凭借其处理和生成自然语言的能力,在人工智能领域占据主导地位。然而,大多数大语言模型的研究与开发仍以英语为中心,导致低资源语言(如东南亚地区的语言)代表性不足。为弥补这一代表性差距,我们推出了Llama-SEA-LION-v3-8B-IT和Gemma-SEA-LION-v3-9B-IT,这是两款专为东南亚语言设计的先进多语言大语言模型。SEA-LION系列模型支持11种东南亚语言,包括英语、中文、印尼语、越南语、马来语、泰语、缅甸语、老挝语、菲律宾语、泰米尔语和高棉语。我们的工作基于大规模多语言持续预训练,并结合了包含多阶段指令微调、对齐和模型融合的综合后训练方案。在多语言基准测试中的评估结果表明,我们的模型在支持东南亚语言的大语言模型中实现了最先进的性能。我们将模型开源,以惠及更广泛的东南亚社区。