SEA-LION：东南亚语言一体化网络 (SEA-LION: Southeast Asian Languages in One Network)

Raymond Ng,Thanh Ngan Nguyen,Yuli Huang,Ngee Chia Tai,Wai Yi Leong,Wei Qi Leong,Xianbin Yong,Jian Gang Ngui,Yosephine Susanto,Nicholas Cheng,Hamsawardhini Rengarajan,Peerat Limkonchotiwat,Adithya Venkatadri Hulagadri,Kok Wai Teng,Yeo Yeow Tong,Bryan Siow,Wei Yi Teo,Wayne Lau,Choon Meng Tan,Brandon Ong,Zhi Hao Ong,Jann Railey Montalan,Adwin Chan,Sajeban Antonyrex,Ren Lee,Esther Choa,David Ong Tat-Wee,Bing Jie Darius Liu,William Chandra Tjhi,Erik Cambria,Leslie Teo

from arxiv, Accepted at IJCNLP-AACL 2025 (Main Track). We released our model at https://huggingface.co/collections/aisingapore/sea-lionv3-672589a39cdadd6a5b199581

Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

翻译：近年来，大语言模型凭借其处理和生成自然语言的能力，在人工智能领域占据主导地位。然而，大多数大语言模型的研究与开发仍以英语为中心，导致低资源语言（如东南亚地区的语言）代表性不足。为弥补这一代表性差距，我们推出了Llama-SEA-LION-v3-8B-IT和Gemma-SEA-LION-v3-9B-IT，这是两款专为东南亚语言设计的先进多语言大语言模型。SEA-LION系列模型支持11种东南亚语言，包括英语、中文、印尼语、越南语、马来语、泰语、缅甸语、老挝语、菲律宾语、泰米尔语和高棉语。我们的工作基于大规模多语言持续预训练，并结合了包含多阶段指令微调、对齐和模型融合的综合后训练方案。在多语言基准测试中的评估结果表明，我们的模型在支持东南亚语言的大语言模型中实现了最先进的性能。我们将模型开源，以惠及更广泛的东南亚社区。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日