Recently, significant public efforts have been directed towards developing low-cost models with capabilities akin to ChatGPT, thereby fostering the growth of open-source conversational models. However, there remains a scarcity of comprehensive and in-depth evaluations of these models' performance. In this study, we examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance. Our analysis is grounded in several publicly accessible, high-quality instruction datasets, as well as our own Chinese multi-turn conversations. We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios. Our goal is to supplement manual evaluations with quantitative analyses, offering valuable insights for the continued advancement of open-source chat models. Furthermore, to enhance the performance and training and inference efficiency of models in the Chinese domain, we extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3 - and conduct secondary pre-training on 3.4B Chinese words. We make our model, data, as well as code publicly available.
翻译:-
(近期,人们为了开发功能和ChatGPT相似的低成本模型,致力于开源会话模型的提高。然而,这些模型的性能缺少全面、深入的评估。本研究依据多个公开且高质量的指令数据集和自己的中文多轮对话,考查训练数据的因素,包括数量、质量和语言分布,来分析其对模型性能的影响。我们利用由1,000个样本组成的评估集,覆盖9个真实场景,评估了不同的模型。我们的目标是通过定量分析,为继续推进开源聊天模型提供宝贵的见解以补充手动评估。此外,为增强模型性能和训练及推理效率,我们扩展了LLaMA的词汇表(最接近私有语言模型GPT-3的开源模型)并对34亿中文单词进行了次级预训练。我们公开了模型、数据和代码。)