Over the last few years, large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP) that fundamentally transform research and developments in the field. ChatGPT represents one of the most exciting LLM systems developed recently to showcase impressive skills for language generation and highly attract public attention. Among various exciting applications discovered for ChatGPT in English, the model can process and generate texts for multiple languages due to its multilingual training data. Given the broad adoption of ChatGPT for English in different problems and areas, a natural question is whether ChatGPT can also be applied effectively for other languages or it is necessary to develop more language-specific technologies. The answer to this question requires a thorough evaluation of ChatGPT over multiple tasks with diverse languages and large datasets (i.e., beyond reported anecdotes), which is still missing or limited in current research. Our work aims to fill this gap for the evaluation of ChatGPT and similar LLMs to provide more comprehensive information for multilingual NLP applications. While this work will be an ongoing effort to include additional experiments in the future, our current paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources. We also focus on the zero-shot learning setting for ChatGPT to improve reproducibility and better simulate the interactions of general users. Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages, calling for further research to develop better models and understanding for multilingual learning.
翻译:近年来,大型语言模型 (LLMs) 已成为自然语言处理 (NLP) 的最重要突破之一,它们深刻地改变了该领域的研究和发展。 ChatGPT 是最近开发的最令人兴奋的 LLM 系统之一,展示了出色的语言生成能力,并高度吸引了公众的关注。在英语中发现 ChatGPT 的各种令人兴奋的应用程序中,该模型可以处理和生成多种语言的文本,这归功于其多语言训练数据。鉴于 ChatGPT 在不同问题和领域的英语中的广泛应用,一个自然的问题是 ChatGPT 是否也可以有效地应用于其他语言,或者需要开发更多的语言特定技术。回答这个问题需要对 ChatGPT 进行全面评估,以便在多种任务和多样化的语言以及大型数据集上提供更全面的信息 (即超出报告的偶然事件),但这一点在当前研究中仍然缺乏或有限。我们的工作旨在填补 ChatGPT 和类似 LLM 的评估领域的这一空白,为多语言 NLP 应用程序提供更多全面的信息。虽然该工作将是未来包括更多实验的持续努力,但我们目前的论文对 ChatGPT 进行了 7 种不同任务的评估,涵盖了 37 种具有高、中、低和极低资源的不同语言。我们还专注于 ChatGPT 的零-shot 环境,以改善再现性并更好地模拟普通用户的交互。与以前模型的性能相比,我们广泛的实验结果表明,对于不同的 NLP 任务和语言,ChatGPT 的表现较差,需要进一步研究以开发更好的模型和了解多语言学习。