Instruction tuning is widely recognized as a key technique for building generalist language models, which has attracted the attention of researchers and the public with the release of InstructGPT~\citep{ouyang2022training} and ChatGPT\footnote{\url{https://chat.openai.com/}}. Despite impressive progress in English-oriented large-scale language models (LLMs), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora. The resulting \textbf{C}hinese \textbf{O}pen \textbf{I}nstruction \textbf{G}eneralist (\textbf{COIG}) corpora are available in Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}} and Github\footnote{\url{https://github.com/FlagOpen/FlagInstruct}}, and will be continuously updated.
翻译:指令调整被广泛认为是构建通用语言模型的关键技术,随着InstructGPT~\citep{ouyang2022training} 和 ChatGPT\footnote{\url{https://chat.openai.com/}}的发布,已经引起了研究人员和公众的关注。尽管英语为基础的大型语言模型(LLM)取得了令人印象深刻的进展,但仍未探究基于英语的基础LLM能否在多语言任务上执行类似于英语任务的指令调整,以及如何构建必要的调整语料库。为了弥补这一差距,我们提出了一个项目,尝试通过适应4个子任务的内在特性来创建中文指令数据集。我们收集了约20万个中文指令调整样本,并进行了高质量的手动检查,以保证其质量。我们还总结了现有的英文和中文指令语料库,并简要描述了新构建的中文指令语料库的一些潜在应用。产生的 \textbf{C}hinese \textbf{O}pen \textbf{I}nstruction \textbf{G}eneralist(\textbf{COIG})语料库可在Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}}和Github\footnote{\url{https://github.com/FlagOpen/FlagInstruct}}上获取,并将持续更新。