中文开放指令通用语模型：初步发布 (Chinese Open Instruction Generalist: A Preliminary Release) - 专知论文

会员服务 ·

0

指令调优 · 语料库 · 语料 · 构建 · 语言模型 ·

2023 年 4 月 17 日

Chinese Open Instruction Generalist: A Preliminary Release

翻译：中文开放指令通用语模型：初步发布

Ge Zhang,Yemin Shi,Ruibo Liu,Ruibin Yuan,Yizhi Li,Siwei Dong,Yu Shu,Zhaoqun Li,Zekun Wang,Chenghua Lin,Wenhao Huang,Jie Fu

Instruction tuning is widely recognized as a key technique for building generalist language models, which comes to the attention of researchers and the public with the release of InstructGPT \cite{ouyang2022training} and ChatGPT [ https://chat.openai.com/ ]. Despite impressive progress in English-oriented large-scale language models (\textbf{LLMs}), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and brief some potential applications of the newly constructed Chinese instruction corpora.

翻译：指令调优被广泛认为是构建通用语言模型的关键技术，在InstructGPT 和ChatGPT的发布中引起了研究人员和公众的关注。\cite{ouyang2022training} 尽管在英语取向的大规模语言模型（\textbf{LLMs}）方面取得了令人印象深刻的进展，但对于基于英语基础 LLMs是否可以通过精心设计指令调优在多语言任务上与英语任务表现类似以及如何构建所需要的语料库仍未得到充分探索。为弥补这一差距，我们提出了该计划，试图通过适应4个子任务的内在特点来创建一个中文指令数据集的方法。我们收集了约200k个中文指令调优样本，并进行了人工检查以保证高质量。我们还总结了现有的英文和中文指令语料库，并简要介绍了新构建的中文指令语料库的一些潜在应用。

0

相关内容

指令调优

阿里巴巴达摩院《从 mPLUG-Owl 浅析类GPT4模型的技术细节》

阿里巴巴达摩院《从 mPLUG-Owl 浅析类GPT4模型的技术细节》

专知会员服务

57+阅读 · 2023年5月12日

【干货书】开放数据结构，Open Data Structures，337页pdf

【干货书】开放数据结构，Open Data Structures，337页pdf

专知会员服务

17+阅读 · 2021年9月17日

SiT: 自监督视觉Transformer

专知会员服务

65+阅读 · 2021年4月11日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

专知

12+阅读 · 2018年2月2日

Marcks调节斑马鱼原肠胚形成中Bmp分泌和转运的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

RACK1调控Dishevelled蛋白和Wnt信号的分子机制与生理意义研究

国家自然科学基金

1+阅读 · 2015年12月31日

CSE1L在神经母细胞瘤发展中的作用及分子机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

长链非编码RNA CAR intergenic 10在细胞衰老中的作用和机制

国家自然科学基金

1+阅读 · 2013年12月31日

Ghrelin对老年性骨骼肌肉减少症的作用及分子机制的研究

国家自然科学基金

0+阅读 · 2013年12月31日

脂肪因子chemerin通过ChemR23依赖性途径对动脉粥样硬化发生、发展和斑块稳定性影响及其作用机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

图像语义自动文本描述技术研究

国家自然科学基金

2+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

水稻OsMYB2P-1基因功能的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于字依存的中文精细结构标注及其学习算法研究

国家自然科学基金

0+阅读 · 2009年12月31日

PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement

Arxiv

0+阅读 · 2023年6月2日

Pretrained Language Model based Web Search Ranking: From Relevance to Satisfaction

Arxiv

0+阅读 · 2023年6月2日

LyricSIM: A novel Dataset and Benchmark for Similarity Detection in Spanish Song LyricS

Arxiv

0+阅读 · 2023年6月2日

Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study

Arxiv

0+阅读 · 2023年6月1日

Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning

Arxiv

0+阅读 · 2023年6月1日

Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech

Arxiv

0+阅读 · 2023年6月1日

Temporal Evolution of Risk Behavior in a Disease Spread Simulation

Arxiv

0+阅读 · 2023年6月1日

Chatting Makes Perfect -- Chat-based Image Retrieval

Chatting Makes Perfect -- Chat-based Image Retrieval

Arxiv

0+阅读 · 2023年5月31日

Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

Arxiv

15+阅读 · 2018年12月4日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

VIP会员

文章信息

相关主题

相关VIP内容

阿里巴巴达摩院《从 mPLUG-Owl 浅析类GPT4模型的技术细节》

阿里巴巴达摩院《从 mPLUG-Owl 浅析类GPT4模型的技术细节》

专知会员服务

57+阅读 · 2023年5月12日

【干货书】开放数据结构，Open Data Structures，337页pdf

【干货书】开放数据结构，Open Data Structures，337页pdf

专知会员服务

17+阅读 · 2021年9月17日

SiT: 自监督视觉Transformer

专知会员服务

65+阅读 · 2021年4月11日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《攻击场景描述形式化模型研究》

【博士论文】理解神经网络的训练动态：从局部优化轨迹与特征学习视角

《战区安全决策课程体系》最新244页

《"无人机航母"原型平台》

相关资讯

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

专知

12+阅读 · 2018年2月2日

相关论文

PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement

Arxiv

0+阅读 · 2023年6月2日

Pretrained Language Model based Web Search Ranking: From Relevance to Satisfaction

Arxiv

0+阅读 · 2023年6月2日

LyricSIM: A novel Dataset and Benchmark for Similarity Detection in Spanish Song LyricS

Arxiv

0+阅读 · 2023年6月2日

Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study

Arxiv

0+阅读 · 2023年6月1日

Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning

Arxiv

0+阅读 · 2023年6月1日

Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech

Arxiv

0+阅读 · 2023年6月1日

Temporal Evolution of Risk Behavior in a Disease Spread Simulation

Arxiv

0+阅读 · 2023年6月1日

Chatting Makes Perfect -- Chat-based Image Retrieval

Chatting Makes Perfect -- Chat-based Image Retrieval

Arxiv

0+阅读 · 2023年5月31日

Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

Arxiv

15+阅读 · 2018年12月4日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

相关基金

Marcks调节斑马鱼原肠胚形成中Bmp分泌和转运的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

RACK1调控Dishevelled蛋白和Wnt信号的分子机制与生理意义研究

国家自然科学基金

1+阅读 · 2015年12月31日

CSE1L在神经母细胞瘤发展中的作用及分子机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

长链非编码RNA CAR intergenic 10在细胞衰老中的作用和机制

国家自然科学基金

1+阅读 · 2013年12月31日

Ghrelin对老年性骨骼肌肉减少症的作用及分子机制的研究

国家自然科学基金

0+阅读 · 2013年12月31日

脂肪因子chemerin通过ChemR23依赖性途径对动脉粥样硬化发生、发展和斑块稳定性影响及其作用机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

图像语义自动文本描述技术研究

国家自然科学基金

2+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

水稻OsMYB2P-1基因功能的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于字依存的中文精细结构标注及其学习算法研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员