Translated title: Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation （面向更好的中文指令跟随语言模型：探究训练数据和评价的影响） Translated abstract: 最近，公共领域在发展类似ChatGPT的低成本模型，以促进开源会话模型的增长。然而，这些模型的性能仍然缺乏全面而深入的评估。本研究通过考查训练数据的数量、质量和语言分布等因素，来分析它们对模型性能的影响。我们的分析基于多个公开且高质量的指令数据集，以及我们自己的中文多轮对话。我们使用一个包含9个真实场景的评估集，评估各种模型。我们的目标是使用定量分析来补充手动评估，为开源聊天模型的持续发展提供有价值的洞察。此外，为增强在中文领域的模型性能和训练及推理效率，我们扩展了LLaMA的词汇表（最接近GPT-3私有语言模型的开源模型），并对34亿个中文单词进行了辅助预训练。我们公开了模型、数据和代码。 (Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation)

翻译：Translated title: Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation （面向更好的中文指令跟随语言模型：探究训练数据和评价的影响） Translated abstract: 最近，公共领域在发展类似ChatGPT的低成本模型，以促进开源会话模型的增长。然而，这些模型的性能仍然缺乏全面而深入的评估。本研究通过考查训练数据的数量、质量和语言分布等因素，来分析它们对模型性能的影响。我们的分析基于多个公开且高质量的指令数据集，以及我们自己的中文多轮对话。我们使用一个包含9个真实场景的评估集，评估各种模型。我们的目标是使用定量分析来补充手动评估，为开源聊天模型的持续发展提供有价值的洞察。此外，为增强在中文领域的模型性能和训练及推理效率，我们扩展了LLaMA的词汇表（最接近GPT-3私有语言模型的开源模型），并对34亿个中文单词进行了辅助预训练。我们公开了模型、数据和代码。

Yunjie Ji,Yan Gong,Yong Deng,Yiping Peng,Qiang Niu,Baochang Ma,Xiangang Li

Recently, significant public efforts have been directed towards developing low-cost models with capabilities akin to ChatGPT, thereby fostering the growth of open-source conversational models. However, there remains a scarcity of comprehensive and in-depth evaluations of these models' performance. In this study, we examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance. Our analysis is grounded in several publicly accessible, high-quality instruction datasets, as well as our own Chinese multi-turn conversations. We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios. Our goal is to supplement manual evaluations with quantitative analyses, offering valuable insights for the continued advancement of open-source chat models. Furthermore, to enhance the performance and training and inference efficiency of models in the Chinese domain, we extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3 - and conduct secondary pre-training on 3.4B Chinese words. We make our model, data, as well as code publicly available.

翻译：- （近期，人们为了开发功能和ChatGPT相似的低成本模型，致力于开源会话模型的提高。然而，这些模型的性能缺少全面、深入的评估。本研究依据多个公开且高质量的指令数据集和自己的中文多轮对话，考查训练数据的因素，包括数量、质量和语言分布，来分析其对模型性能的影响。我们利用由1,000个样本组成的评估集，覆盖9个真实场景，评估了不同的模型。我们的目标是通过定量分析，为继续推进开源聊天模型提供宝贵的见解以补充手动评估。此外，为增强模型性能和训练及推理效率，我们扩展了LLaMA的词汇表（最接近私有语言模型GPT-3的开源模型）并对34亿中文单词进行了次级预训练。我们公开了模型、数据和代码。）