中文 LLaMA 和 Alpaca 的高效有效文本编码 (Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca) - 专知论文

会员服务 ·

0

GitHub · 微调 · 语言处理 · 自然语言处理 · 语义理解 ·

2023 年 4 月 17 日

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

翻译：中文 LLaMA 和 Alpaca 的高效有效文本编码

Yiming Cui,Ziqing Yang,Xin Yao

from arxiv, 13 pages

Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca large models, emphasizing instruction fine-tuning. We expand the original LLaMA's Chinese vocabulary by adding 20K Chinese tokens, increasing encoding efficiency and enhancing basic semantic understanding. By incorporating secondary pre-training using Chinese data and fine-tuning with Chinese instruction data, we substantially improve the models' comprehension and execution of instructions. Our pilot study serves as a foundation for researchers adapting LLaMA and Alpaca models to other languages. Resources are made publicly available through GitHub, fostering open research in the Chinese NLP community and beyond. GitHub repository: https://github.com/ymcui/Chinese-LLaMA-Alpaca

翻译：大型语言模型（LLMs，如ChatGPT和GPT-4）已经在自然语言处理研究中实现了革命性的进展，并展示了在通用人工智能（AGI）方面的潜力。但是，昂贵的LLMs训练和部署给透明和开放的学术研究带来了挑战。为了解决这些问题，本项目开源了中文的LLaMA和Alpaca大模型，强调了指令微调。我们通过添加20K个中文单词，扩展了原始的LLaMA中文词汇表，从而增加了编码效率并增强了基本的语义理解能力。通过使用中文数据进行辅助预训练并使用中文指令数据进行微调，我们大大提高了模型对指令的理解和执行能力。我们的试点研究为将LLaMA和Alpaca模型应用于其他语言的研究人员提供了基础。资源可以通过GitHub公开获取，促进中文自然语言处理社区及其它领域的开放性研究。GitHub仓库：https://github.com/ymcui/Chinese-LLaMA-Alpaca

1

相关内容

GitHub

http://GitHub.com 使用 Git 作为版本控制系统（version control system）提供在线源码托管的服务，同时是个有社交功能的开发者社区。国外类似服务： http://Bitbucket.com
http://Gitlab.com
国内类似服务：
http://Coding.net

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

5400亿！谷歌「Pathways语言模型」发布，能理解做推理生成代码

5400亿！谷歌「Pathways语言模型」发布，能理解做推理生成代码

专知会员服务

40+阅读 · 2022年4月5日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【Manning新书】迁移学习自然语言处理，266页pdf，Transfer Learning for NLP

【Manning新书】迁移学习自然语言处理，266页pdf，Transfer Learning for NLP

专知会员服务

137+阅读 · 2021年11月6日

深度学习搜索，Exploring Deep Learning for Search

深度学习搜索，Exploring Deep Learning for Search

专知会员服务

61+阅读 · 2020年5月9日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

专知会员服务

21+阅读 · 2019年12月12日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

【KDD2019教程】从浅层到深层的语言表达:预训练、微调，等等，From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond

【KDD2019教程】从浅层到深层的语言表达:预训练、微调，等等，From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond

专知会员服务

16+阅读 · 2019年11月4日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

RoBERTa中文预训练模型，你离中文任务的「SOTA」只差个它

RoBERTa中文预训练模型，你离中文任务的「SOTA」只差个它

机器之心

40+阅读 · 2019年9月5日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

简单高效的Bert中文文本分类模型开发和部署

简单高效的Bert中文文本分类模型开发和部署

AINLP

49+阅读 · 2019年6月6日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

受体相互作用蛋白3（RIP3）促进I型干扰素分泌的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

Beclin 1-VPS34复合体对神经细胞内β淀粉样蛋白稳态的调控作用及机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向汉语文本理解的语义计算方法

国家自然科学基金

8+阅读 · 2014年12月31日

长链非编码RNA CAR intergenic 10在细胞衰老中的作用和机制

国家自然科学基金

1+阅读 · 2013年12月31日

控制系统的约束矩阵方程及其高效数值算法

国家自然科学基金

0+阅读 · 2013年12月31日

无指导汉语文本挖掘的统计模型和统计推断

国家自然科学基金

0+阅读 · 2013年12月31日

Riemann-Hilbert 方法和随机矩阵谱分析中的 Painleve 渐近

国家自然科学基金

0+阅读 · 2012年12月31日

基于延时特征抑制的互耦半导体激光器混沌熵源并行获取双路高速真随机数

国家自然科学基金

0+阅读 · 2011年12月31日

汉语儿童英语阅读障碍的神经基础研究

国家自然科学基金

1+阅读 · 2011年12月31日

神经母细胞瘤DHRS4基因选择性剪接机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Arxiv

0+阅读 · 2023年6月1日

Correcting Semantic Parses with Natural Language through Dynamic Schema Encoding

Arxiv

0+阅读 · 2023年5月31日

Elixir: Train a Large Language Model on a Small GPU Cluster

Arxiv

0+阅读 · 2023年5月31日

Collective Intelligence for Deep Learning: A Survey of Recent Developments

Arxiv

22+阅读 · 2021年12月22日

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

Arxiv

28+阅读 · 2021年6月16日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

Efficient Transformers: A Survey

Arxiv

23+阅读 · 2020年9月16日

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Arxiv

11+阅读 · 2019年10月30日

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Arxiv

11+阅读 · 2019年9月23日

X-BERT: eXtreme Multi-label Text Classification with BERT

X-BERT: eXtreme Multi-label Text Classification with BERT

Arxiv

12+阅读 · 2019年7月4日

VIP会员

文章信息

相关主题

自然语言处理

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

5400亿！谷歌「Pathways语言模型」发布，能理解做推理生成代码

5400亿！谷歌「Pathways语言模型」发布，能理解做推理生成代码

专知会员服务

40+阅读 · 2022年4月5日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【Manning新书】迁移学习自然语言处理，266页pdf，Transfer Learning for NLP

【Manning新书】迁移学习自然语言处理，266页pdf，Transfer Learning for NLP

专知会员服务

137+阅读 · 2021年11月6日

深度学习搜索，Exploring Deep Learning for Search

深度学习搜索，Exploring Deep Learning for Search

专知会员服务

61+阅读 · 2020年5月9日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

专知会员服务

21+阅读 · 2019年12月12日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

【KDD2019教程】从浅层到深层的语言表达:预训练、微调，等等，From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond

【KDD2019教程】从浅层到深层的语言表达:预训练、微调，等等，From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond

专知会员服务

16+阅读 · 2019年11月4日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

热门VIP内容

开通专知VIP会员享更多权益服务

数据智能体综述：新兴范式还是被高估的炒作？

海底战已至：美国构思海底安全战略 | 最新报告

【ICCV2025教程】视觉异常检测中的基础模型：进展、挑战与应用

美军将无人自主等新技术融入潜艇部队以更具杀伤力

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

RoBERTa中文预训练模型，你离中文任务的「SOTA」只差个它

RoBERTa中文预训练模型，你离中文任务的「SOTA」只差个它

机器之心

40+阅读 · 2019年9月5日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

简单高效的Bert中文文本分类模型开发和部署

简单高效的Bert中文文本分类模型开发和部署

AINLP

49+阅读 · 2019年6月6日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

相关论文

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Arxiv

0+阅读 · 2023年6月1日

Correcting Semantic Parses with Natural Language through Dynamic Schema Encoding

Arxiv

0+阅读 · 2023年5月31日

Elixir: Train a Large Language Model on a Small GPU Cluster

Arxiv

0+阅读 · 2023年5月31日

Collective Intelligence for Deep Learning: A Survey of Recent Developments

Arxiv

22+阅读 · 2021年12月22日

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

Arxiv

28+阅读 · 2021年6月16日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

Efficient Transformers: A Survey

Arxiv

23+阅读 · 2020年9月16日

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Arxiv

11+阅读 · 2019年10月30日

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Arxiv

11+阅读 · 2019年9月23日

X-BERT: eXtreme Multi-label Text Classification with BERT

X-BERT: eXtreme Multi-label Text Classification with BERT

Arxiv

12+阅读 · 2019年7月4日

相关基金

受体相互作用蛋白3（RIP3）促进I型干扰素分泌的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

Beclin 1-VPS34复合体对神经细胞内β淀粉样蛋白稳态的调控作用及机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向汉语文本理解的语义计算方法

国家自然科学基金

8+阅读 · 2014年12月31日

长链非编码RNA CAR intergenic 10在细胞衰老中的作用和机制

国家自然科学基金

1+阅读 · 2013年12月31日

控制系统的约束矩阵方程及其高效数值算法

国家自然科学基金

0+阅读 · 2013年12月31日

无指导汉语文本挖掘的统计模型和统计推断

国家自然科学基金

0+阅读 · 2013年12月31日

Riemann-Hilbert 方法和随机矩阵谱分析中的 Painleve 渐近

国家自然科学基金

0+阅读 · 2012年12月31日

基于延时特征抑制的互耦半导体激光器混沌熵源并行获取双路高速真随机数

国家自然科学基金

0+阅读 · 2011年12月31日

汉语儿童英语阅读障碍的神经基础研究

国家自然科学基金

1+阅读 · 2011年12月31日

神经母细胞瘤DHRS4基因选择性剪接机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员