Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot -- i.e., without adaptation on downstream data. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations. However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot. In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories. With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT. We find that ChatGPT performs well on many tasks favoring reasoning capabilities (e.g., arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging. We additionally provide in-depth analysis through qualitative case studies.
翻译:在规模进展的推动下,大型语言模型(LLMS)显示有能力执行各种自然语言处理(NLP)零发任务 -- -- 即不修改下游数据。最近,ChatGPT的首发引起了自然语言处理(NLPT)社区的极大关注,因为它能够产生高质量的人文投入反应和基于随后对话的自我纠正的以往错误。然而,尚不知道ChatGPT能否作为能够执行许多NLP任务零发的通才模型。在这项工作中,我们通过对涵盖7个代表性任务的20个流行NLP数据集进行评估,对CatGPT零发学习能力进行了经验分析。通过广泛的实证研究,我们证明目前版本的ChatGPT的有效性和局限性。我们发现,ChatGPT在支持推理能力(例如算逻辑推理)的许多任务上表现良好,但在解决诸如标记序列等具体任务时仍面临挑战。我们通过定性案例研究提供更深入的分析。