Translated title: 中国Open Instruction通用模型：初步发布 (Chinese Open Instruction Generalist: A Preliminary Release) - 专知论文

会员服务 ·

0

指令调优 · 语料库 · 语料 · 构建 · 中国 ·

2023 年 4 月 21 日

Chinese Open Instruction Generalist: A Preliminary Release

翻译：Translated title: 中国Open Instruction通用模型：初步发布

Ge Zhang,Yemin Shi,Ruibo Liu,Ruibin Yuan,Yizhi Li,Siwei Dong,Yu Shu,Zhaoqun Li,Zekun Wang,Chenghua Lin,Wenhao Huang,Jie Fu

Instruction tuning is widely recognized as a key technique for building generalist language models, which has attracted the attention of researchers and the public with the release of InstructGPT~\citep{ouyang2022training} and ChatGPT\footnote{\url{https://chat.openai.com/}}. Despite impressive progress in English-oriented large-scale language models (LLMs), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora. The resulting \textbf{C}hinese \textbf{O}pen \textbf{I}nstruction \textbf{G}eneralist (\textbf{COIG}) corpora are available in Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}} and Github\footnote{\url{https://github.com/BAAI-Zlab/COIG}}, and will be continuously updated.

翻译：Translated abstract: 指令调优被广泛认为是构建通用语言模型的关键技术，在InstructGPT（Ouyang等，2022）和ChatGPT的发布之后，受到了研究人员和公众的关注。尽管英语为基础的大规模语言模型（LLMs）取得了令人瞩目的进展，但仍未探索基于英语基础的LLMs在多语种任务中是否能像在英语任务中那样通过良好的指令调优来表现，以及我们如何构建所需的资源。为弥补这一空白，我们提出了该项目，旨在通过适应4个子任务的内在特点，采用多种方法创建一个中国指令数据集。我们收集了约20万个中文指令调优样本，并进行了手工审核以保证高质量。我们还总结了现有的英语和中文指令语料库，并简要描述了新构建的中文指令语料库的一些潜在应用。结果，中国Open Instruction通用模型（COIG）语料库在Huggingface和Github上可用，并将持续更新。

0

相关内容

指令调优

【2023新书】给Python程序员的GPT指南

【2023新书】给Python程序员的GPT指南

专知会员服务

170+阅读 · 2023年5月9日

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

【Manning2020新书】Elm 实战，344页pdf，Elm in Action

【Manning2020新书】Elm 实战，344页pdf，Elm in Action

专知会员服务

51+阅读 · 2020年4月14日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

专知会员服务

80+阅读 · 2019年10月27日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

打开模型Zero-Shot新范式：Instruction Tuning

打开模型Zero-Shot新范式：Instruction Tuning

PaperWeekly

2+阅读 · 2022年8月25日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Awesome-Chinese-NLP：中文自然语言处理相关资料

Awesome-Chinese-NLP：中文自然语言处理相关资料

AINLP

30+阅读 · 2019年2月17日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

无场景预设的云存储关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

遗传性血管水肿（HAE）临床异质性的分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

TMOD1调节actin聚合影响胰岛素信号转导的分子机制

国家自然科学基金

0+阅读 · 2013年12月31日

基于海量语料自然标注信息的汉语自然语块分析

国家自然科学基金

0+阅读 · 2013年12月31日

不完全数据下分位数回归模型的经验似然推断

国家自然科学基金

1+阅读 · 2013年12月31日

血小板上调乳腺癌细胞膜的整合素表达进而促进癌细胞转移的分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

PI3K/AKT/mTOR信号通路激活致Ddit4基因高表达是百草枯肺损伤的主要分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

基于RDF的软件工程数据存储与检索技术研究

国家自然科学基金

1+阅读 · 2008年12月31日

On the Reliability of Watermarks for Large Language Models

Arxiv

0+阅读 · 2023年6月7日

M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Arxiv

0+阅读 · 2023年6月7日

Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods

Arxiv

0+阅读 · 2023年6月7日

Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese

Arxiv

0+阅读 · 2023年6月7日

Augmenting Reddit Posts to Determine Wellness Dimensions impacting Mental Health

Arxiv

0+阅读 · 2023年6月6日

Understanding the Effectiveness of Early Weight Averaging for Training Large Language Models

Arxiv

0+阅读 · 2023年6月5日

Repository-Level Prompt Generation for Large Language Models of Code

Arxiv

1+阅读 · 2023年6月5日

InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models

Arxiv

0+阅读 · 2023年6月5日

Learning to Relate to Previous Turns in Conversational Search

Arxiv

0+阅读 · 2023年6月5日

MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

Arxiv

0+阅读 · 2023年6月4日

VIP会员

文章信息

相关主题

相关VIP内容

【2023新书】给Python程序员的GPT指南

【2023新书】给Python程序员的GPT指南

专知会员服务

170+阅读 · 2023年5月9日

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

【ACL2020】不要停止预训练:根据领域和任务自适应调整语言模型，Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

专知会员服务

46+阅读 · 2020年4月25日

【Manning2020新书】Elm 实战，344页pdf，Elm in Action

【Manning2020新书】Elm 实战，344页pdf，Elm in Action

专知会员服务

51+阅读 · 2020年4月14日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

专知会员服务

80+阅读 · 2019年10月27日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《复杂工程系统模型驱动设计决策支持系统：早期设计阶段挑战》最新138页

《日本陆上自卫队2040年作战方式与未来作战研究》最新23页slides

人工智能作为战争武器

《后勤保障》最新23页

相关资讯

打开模型Zero-Shot新范式：Instruction Tuning

打开模型Zero-Shot新范式：Instruction Tuning

PaperWeekly

2+阅读 · 2022年8月25日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Awesome-Chinese-NLP：中文自然语言处理相关资料

Awesome-Chinese-NLP：中文自然语言处理相关资料

AINLP

30+阅读 · 2019年2月17日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

相关论文

On the Reliability of Watermarks for Large Language Models

Arxiv

0+阅读 · 2023年6月7日

M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Arxiv

0+阅读 · 2023年6月7日

Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods

Arxiv

0+阅读 · 2023年6月7日

Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese

Arxiv

0+阅读 · 2023年6月7日

Augmenting Reddit Posts to Determine Wellness Dimensions impacting Mental Health

Arxiv

0+阅读 · 2023年6月6日

Understanding the Effectiveness of Early Weight Averaging for Training Large Language Models

Arxiv

0+阅读 · 2023年6月5日

Repository-Level Prompt Generation for Large Language Models of Code

Arxiv

1+阅读 · 2023年6月5日

InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models

Arxiv

0+阅读 · 2023年6月5日

Learning to Relate to Previous Turns in Conversational Search

Arxiv

0+阅读 · 2023年6月5日

MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

Arxiv

0+阅读 · 2023年6月4日

相关基金

无场景预设的云存储关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

遗传性血管水肿（HAE）临床异质性的分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

TMOD1调节actin聚合影响胰岛素信号转导的分子机制

国家自然科学基金

0+阅读 · 2013年12月31日

基于海量语料自然标注信息的汉语自然语块分析

国家自然科学基金

0+阅读 · 2013年12月31日

不完全数据下分位数回归模型的经验似然推断

国家自然科学基金

1+阅读 · 2013年12月31日

血小板上调乳腺癌细胞膜的整合素表达进而促进癌细胞转移的分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

PI3K/AKT/mTOR信号通路激活致Ddit4基因高表达是百草枯肺损伤的主要分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

基于RDF的软件工程数据存储与检索技术研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员