Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca large models, emphasizing instruction fine-tuning. We expand the original LLaMA's Chinese vocabulary by adding 20K Chinese tokens, increasing encoding efficiency and enhancing basic semantic understanding. By incorporating secondary pre-training using Chinese data and fine-tuning with Chinese instruction data, we substantially improve the models' comprehension and execution of instructions. Our pilot study serves as a foundation for researchers adapting LLaMA and Alpaca models to other languages. Resources are made publicly available through GitHub, fostering open research in the Chinese NLP community and beyond. GitHub repository: https://github.com/ymcui/Chinese-LLaMA-Alpaca
翻译:大型语言模型(LLMs,如ChatGPT和GPT-4)已经在自然语言处理研究中实现了革命性的进展,并展示了在通用人工智能(AGI)方面的潜力。但是,昂贵的LLMs训练和部署给透明和开放的学术研究带来了挑战。为了解决这些问题,本项目开源了中文的LLaMA和Alpaca大模型,强调了指令微调。我们通过添加20K个中文单词,扩展了原始的LLaMA中文词汇表,从而增加了编码效率并增强了基本的语义理解能力。通过使用中文数据进行辅助预训练并使用中文指令数据进行微调,我们大大提高了模型对指令的理解和执行能力。我们的试点研究为将LLaMA和Alpaca模型应用于其他语言的研究人员提供了基础。资源可以通过GitHub公开获取,促进中文自然语言处理社区及其它领域的开放性研究。GitHub仓库:https://github.com/ymcui/Chinese-LLaMA-Alpaca